Site Reliability Team Lead at News UK

View My GitHub Profile

24 November 2011

Case Study Optimising A Cloud Application

by Mike

I was recently brought in to examine the infrastructure of a small startup. This wasn’t anything really special, I do it quite often for various reasons. What was different was that they didn’t have issues with scaling out particularly - they had that working well with their shared nothing web application and mongodb backend. What they were having issues with was their infrastructure costs.

I normally work on through a 6 step process that has been built up over time -

  1. Monitor and gather stats/measure the problem,
  2. Standardise on a reference architecture,
  3. Add configuration management and version control,
  4. Start to define a playbook of how to do things (like up/downscale or provision new machines and clusters) and start to automate them,
  5. Bring everything to reference architecture/consolidate underutilised servers and eliminate unused infrastructure,
  6. Consider architecture changes to make it more efficient.
  7. ...and repeat

I will take you through a case study showing how this process was used to lower their monthly costs. Names and details have been changed in places to protect the guilty… ;)

Monitor and Gather Stats / Measure the Problem

Unless you know what the servers are doing it is hard to know what to adjust or fix when things go wrong. Proper instrumentation is essential for any non-trivial application.

The architecture roughly consisted of the following:

The biggest problem they were having is they just didn’t know what their infrastructure was doing. They had server and service monitoring with nagios, but besides knowing that the service was up or down they had no metrics and stats from the various components within their architecture. We installed munin to correct this shortfall - it isn’t ideal, but it is a pretty good starting point for looking at historical data for trends and patterns of usage.

What we found was what I suspected at the start - they could scale out really well, but much of their infrastructure was operating at only around an average 10-15% utilisation. Some of this was due to normal slack periods after the application peak times. It was aimed at business users so around 10am-12pm and 2-4pm the usage would peak at 100s of times the load that it faces at 2am (their quietest time). This is fairly typical in modern SaaS apps and wasn’t a massive suprise.

As I said before, the team was pretty savvy and had pretty much nailed their scale out issue by choosing software (Varnish + PHP + MongoDB in this case) that could scale out horizontally without too much of an issue. I was suprised, especially considering how well they had implemented the rest of the architecture, to find no obvious cache. Memcache is usual and found in probably 90% of large scale cloud apps.

On discussing it with one of the developers we found that memcache was originally specified but the team manager at the time shot it down in flames due to some previous bad experience with it. Instead the developers worked around the issue by using APC (the PHP opcode cache extension) to have a per node cache. This wasn’t particularly efficient as the original app was designed so the developers also added session-affinity to the varnish configuration to make sure the user data was only stored in a single node rather than updating the database for every request (which is what they had ended up doing with their sessions). Storing the sessions on the database also added MySQL to the mix when really it wasn’t needed.

Personally, I feel that you should really go through all the steps before you should be making architecture changes, but one of the developers setup 2 memcache clusters (one for sessions and one for general caching) and rewrote the caching section of the app in a few hours, so we rolled with it and started again at step 1…

Monitor and Gather Stats / Measure the Problem (again)

The disadvantage of storing sessions in memcache is that if one of the Memcache servers goes offline or starts to run out of memory it will log users out. This is a BadThing™.

The architecture roughly consisted of the following:

Adding Memcache actually made the utilisation far worse. The next day utilisation was down to 7%… It seems that the added load of the sessions in the database pretty much halved the overall efficiency of the servers. We added Memcache monitoring into Munin and setup some alerts in Nagios for the new servers and moved onto the next stage.

Standardise on a reference architecture

Having dozens of separate hand-rolled configurations is a maintenance nightmare. Make all boxes as similar as possible except where they need to differ. Take a standard image and install only the needed software for that function - no more, no less.

Many times I have come to a company and found that while all the servers might be running the same OS (Ubuntu Linux for example), often they are in various different patch levels and in some cases sometimes one or two versions behind. This time while not quite as bad as that, wasn’t that far from this case.

All servers were running Debian Squeeze - nothing wrong with that and mostly the servers had similar configurations… but not all. Many of the boxes had been built when they started writing the application and had a lot of excess packages and configuration file backups all over the place. Also two of the systems were setup with more memory than the others from when someone was testing something. This is pretty common situation, but not too bad.

We took a clone of one of the each of the systems and pared it down to the essentials then started to document exactly what was needed for each task - what repos were needed, how much memory was needed and what configuration files needed to be edited from the defaults. We added this information to the playbook - a document that tells you exactly what things to do and in what order to do a certain task; in this case to provision a new machine from the default image.

Add configuration management and version control

In this case the developers didn’t feel that it was the right time to add a configuration management system to their architecture. I argued against it but the client is always right… (supposedly) However, I did get them to agree to using a version control system (in this case git) for keeping the changing versions of the config files in for each machine. This allows the developers to rollback changes if they need to and also helps them see what has changed between now and sometime in the past. Not ideal, but you take what victories you can get.

Start to define a playbook of how to do things (like up/downscale or provision new machines and clusters) and start to automate them,

The playbook isn't actually a stage, it is a document that goes along side your day to day running of the systems. It is used so everyone knows the correct way to do the various tasks that are needed in the creation, maintenance and repair of your systems. It can be a wiki (as it was in this case) a fileshare with text documents on it or even an actual book (grimoire).

We’d already made notes about how to take a new machine from bare image to functional server and we started to make additional notes throughout the day. We ended up with about 20 separate tasks that are required on a frequent basis to keep the systems running properly.

Bring everything to reference architecture/consolidate underutilised servers and eliminate unused infrastructure,

One by one, we added new servers to replace the few servers that were built with too much memory (equivalent to large to small on Amazon (not that they were on Amazon)) and swapped them out. Due to availability issues, we couldn’t simply remove them completely. As a rule of thumb, you want enough capacity for peak times + one spare server and this is what they had at this point. However, even without increasing the average utilisation that much we reduced their monthly costs by around $190/mo per server for these two servers. We also noticed that some servers really were running nothing most of the time. They were simply eating up money for no real return. The few tasks on these systems were moved to a single small instance and these machines were terminated. In total around 8 machines were removed, which together reduced monthly costs by around $500.

Total monthly savings from all the changes so far around $700… but as they grow the savings will increase too…

Consider architecture changes to make it more efficient

They had already added Memcache to their systems to reduce the number of queries hitting the database, but their peak load was still the main problem. Some companies follow best practice and move all their static images to a fast image server, or sometimes onto a CDN. Neither of these had been done with this application. The other thing you can do with static files is to give them a nice long expiry time so that browsers and intermediate proxies keep them around for a while and don’t have to load them everytime they goto the page. The default install of Apache they were using at the time does a pretty good job in this regard. While it doesn’t set an expires header without you telling it to, it does send an entity tag which the browser can then use to ask if this has changed since the last time it was read. This does save bandwidth, but the browser still has to ask the server for the file, which on slow links can be a few hundred milliseconds. They already had analytics software on their web page and along with their server logs it was easy to see that image and script loading was accounting for around 30% of their bandwidth costs. On their page they were importing around 6 or 7 scripts in addition to any static images. Of these 4 of them were available on the Google CDN (http://code.google.com/apis/libraries/devguide.html) which saved around 6% of their bandwidth. The nice side-effect of this was that their pages also loaded faster as most web browsers limit the number of requests they will make in parallel. With the scripts coming off of the google servers (and being cachable) they were often already available in the browser or could be fetched as soon as the page was loaded along side any scripts or images from the application’s servers.

Most other static files were moved onto a separate pair of static servers running nginx. Once the the expires header was set to max, most of the static files were only loaded once across the whole site. With all the static images gone from the PHP servers, much of load at peak times was removed and the number of PHP servers was brought down from 13 (12+1 spare) to 8. Which made up for the 2 extra servers they ended up adding to the architecture. They could have just as easily pushed all the static files to a service such as S3 or Rackspace Cloudfiles, but they believed that this was too much of a risk - It isn’t.

One of the problems they had at peak times was that often a request would fire off a function that consumed a lot of processing power for a couple of tenths of a second. When many of these requests came in within a short space of time on a single server the server would slow down appreciatively. These processes didn’t need to be fired at the time of request, often the data wouldn’t be used for a few minutes (or sometimes never). With this in mind I suggested adding Gearman to the mix. Gearman is a priority based queuing system that in this situation would allow the request to drop the data on the queue to be picked up by a worker at a later point. This would have eliminated many of these peaks and reduced the number of servers required to service peak periods. Alas this was also vetoed and we couldn’t reduce the number of servers further.

If they hadn’t rejected the idea of configuration management completely, then bringing a new PHP server up would be pretty simple. Add to config database, turn on, server installs required packages and pulls config from configuration system, nagios notes the server is now up and runs a second test from the notification script - on success it is added to the varnish config, varnish is triggered to pull its config and load starts hitting the server. This means that once the peak load is gone, you can simply turn off a server to save money and conversely when load approaches a threshold you start up another server. I personally dislike the idea of autoscaling databases and similar things since there is more possbility of dataloss if something goes wrong.