This can be very difficult. We wish we could push a magic button and take care of it for you, but the truth is that the scalability and performance of your site is much more under your control than ours. When sites encounter performance issues (such as 503 errors) under load, we almost always find that it is some problem with the site's programming that is responsible. Based on our extensive experience with such situations, we have prepared some general guidelines to help you design a site that is fast and scalable and remains available under heavy load.
First, if you are running WordPress, this information will not be much help. Basic WordPress installs are hopelessly incapable of surviving traffic surges. It requires extensive, specialized tuning to have a WordPress site that remains available under load. We recommend that you read and follow our Advanced WordPress Configuration guide, but even that is mostly in the nature of a good start.
For applications you are developing yourself, the following information may be helpful.
- The basic principle of scalability is that however fast requests come in, they must complete at the same speed. If they do not, a backlog will form, response times will increase, and the site will eventually fail. This is the web's equivalent to the law of gravity: what goes in must come out. You cannot get around it.
- Take advantage of our network to cache whatever you can. If you set Cache-Control: or Expires: headers for scripted content, our system can store copies in RAM and serve that content orders of magnitude faster than if you generate the same content over and over from a script for each request. (This is helpful primarily for content that is shared between users. I.e. if you personalize a page for each visitor -- such as by using cookies -- caching is unlikely to help. Keep this in mind when designing your site, and try to avoid using cookies or sessions where they are not needed, especially on top-level or high-traffic pages.)
- Consider the availability, response times, and usage limits of any external resources your site depends on, like API's. The 100-requests-per-minute limit on that free message queueing web service (or mapping API, or geo IP lookup service, or whatever) you're using seems like a lot until you wind up on the front page of Reddit. Likewise, if your site depends on an external resource and that resource goes down, it's likely to cause timeouts on your site, and those can take forever, leading to a rapid collapse of the site.
- Don't do anything that causes incoming requests to be serialized, as that will cause the site to work fine normally (or in testing), but completely collapse under even trivial load. Nobody intentionally serializes requests; it's generally the side effect of something that seemed like a good idea at the time. Making every request obtain an exclusive lock on the same file, like a log file, is one common way to do this. (N.B. if you use mod_rewrite and enable logging, it does exactly this.)
- Your database can also be a source of performance problems. Although SQLite performs extremely well under low concurrency, it uses file locking and can have major scalability problems if any writes are performed. Don't use SQLite for an application you intend to scale unless it is read only, use MySQL instead. MySQL's base performance is slower, but it handles high concurrency much better. However, it is not a magic bullet. You should carefully consider indexing and locking, and monitor slow queries in order to get good performance from MySQL. With MySQL, there are often performance thresholds where exceeding a limit causes a dramatic drop in performance. One example is the use of temporary tables. Small temporary tables are stored in RAM. But large ones are written to disk, causing a dramatic drop in performance the minute a table exceeds the threshold for being stored in RAM.
- If your site uses caching or temporary files (including PHP's default file-backed session handling), make sure the files are well-organized and routinely pruned. Directories containing tens of thousands of files that are traversed for every request can saturate your available I/O resulting in extremely poor performance.
- If part or all of your application requires very high concurrency, particularly for long lived requests, don't use PHP. PHP's one-process-per-request model is not well suited to large numbers of concurrent requests, especially if they are mostly idle. Something like Node.JS is much more effective for that purpose and can scale to thousands of simultaneous connections, whereas PHP will typically choke at dozens. (Or, if it does not, you will likely choke when you see the bill; Node.JS not only handles high concurrency much better, but it also handles it much more cost effectively.)
Building scalable web sites is a high art, and the tips shared here only scratch the surface. But hopefully they offer a place to start on the long, hard journey to building a site that can handle hundreds of thousands of simultaneous users and dozens of requests per second. (And yes, we host those types of sites.)