Phusion white papers Phusion overview

Union Station beta is back online, and here’s what we have been up to

By Hongli Lai on March 4th, 2011

Union Station is our web application performance monitoring and behavior analysis service. Its public beta was launched on March 2.

Within just 12 hours after the public beta launch of Union Station we were experiencing performance issues. Users noticed that their data wasn’t showing up and that the site is slow. We tried to expand our capacity on March 3 without taking it offline, but eventually we were left with no choice but to do so anyway. Union Station is now back online and we’ve expanded our capacity three fold. Being forced to take it offline for scaling so soon is both a curse and a blessing. On the one hand we deeply regret the interruption of service and the inconvenience that it has caused. On the other hand we are glad that so many people are interested in Union Station that caused scaling issues on day 1.

In this blog post we will explain:

  • The events that had led to the scaling issues.
  • What capacity we had before scaling, how our infrastructure looked like and what kind of traffic we were expecting for the initial launch.
  • How we’ve scaled Union Station now and how our infrastructure looks like now.
  • What we’ve learned from it and what we will do to prevent similar problems in the future.

Preparation work before day 1

Even before the launch, our software architecture was designed to be redundant (no single point of failure) and to scale horizontally across multiple machines if necessary. The most important components are the web app, MySQL, MongoDB (NoSQL database), RabbitMQ (queuing server) and our background workers. The web app receives incoming data, validates them and puts them in RabbitMQ queues. The background workers listen on the RabbitMQ queues, transform the data into more usable forms and index them into MongoDB. The bulk of the data is stored in MongoDB while MySQL is used to for storing small data sets.

Should scaling ever be necessary then every component can be scaled to multiple machines. The web app is already trivially scalable: it is written in Ruby so we can just add more Phusion Passenger instances and hook them behind the HTTP load balancer. The background workers and RabbitMQ are also very easy to scale: each worker is stateless and can listen from multiple RabbitMQ queues so we can just add more workers and RabbitMQ instances indefinitely. Traditional RDBMSes are very hard to write-scale across multiple servers and typically require sharding at the application level. This is the primary reason why we chose MongoDB for storing the bulk of the data: it allows easy sharding with virtually no application-level changes. It allows easy replication and its schemaless nature fits our data very well.

In extreme scenarios our cluster would end up looking like this:


Abstract overview of the Union Station software architecture (updated)

We started with a single dedicated server with an 8-core Intel i7 CPU, 8 GB of RAM, 2×750 GB harddisks in RAID-1 configuration and a 100 Mbit network connection. This server hosted all components in the above picture on the same machine. We explicitly chose not to use several smaller, virtualized machines in the beginning of the beta period for efficiency reasons: our experience with virtualization is that they impose significant overhead, especially in the area of disk I/O. Running all services on the bare metal allows us to get the most out of the hardware.

Update: we didn’t plan on running on a single server forever. The plan was to run on a single server for a week or two, see whether people are interested in Union Station, and if so add more servers for high availability etc. That’s why we launched it as a beta and not as a final.

During day 1 and day 2

Equipped with a pretty beefy machine like this we thought it would hold out for a while, allowing us to gauge whether Union Station will be a success before investing more in hardware. It turned out that we had underestimated the amount of people who would join the beta as well as the amount of data they have.

Within 12 hours of launching the beta, we had already received over 100 GB of data. Some of our users sent hundreds of request logs per second to our service, apparently running very large websites. The server had CPU power in abundance, but with this much data our hard disk started becoming the bottleneck: the background workers were unable to write the indexed data to MongoDB quickly enough. Indeed, ‘top’ showed that the CPUs were almost idle while ‘iotop’ showed that the hard disks were running at full speed. As a result the RabbitMQ queues started becoming longer and longer. This is the reason why many users who registered after 8 hours didn’t see their data for several minutes. By the next morning the queue had become even longer, and the background workers were still busy indexing data from 6 hours ago.

During day 2 we performed some emergency filesystem tweaks in an attempt to make things faster. This did not have significant effect, and after a short while it was clear: the server could no longer handle the incoming data quickly enough and we need more hardware. The plan in the short term was to order additional servers and shard MongoDB across all 3 servers, thereby cutting the write load on each shard by 1/3rd. During the afternoon we started ordering 2 additional servers with 24 GB RAM and 2×1500 GB hard disks each, which were provisioned within several hours. We setup these new harddisks in RAID-0 instead of RAID-1 this time for better write performance, and we formatted with the XFS filesystem because that tends to perform best with large database files like MongoDB’s. By the beginning of the evening, the new servers were ready to go.

Update: RAID-0 does mean that if one disk fails we lose pretty much all data. We take care of this by making separate backups and setting up replicas on other servers. We’ve never considered RAID-1 to be a valid backup strategy. And of course RAID-0 is not a silver bullet for increasing disk speed but it does help a little, and all tweaks and optimizations add up.

Update 2: Some people have pointed out that faster setups exist, e.g. by using SSD drives or by getting servers with more RAM for keeping the data set in memory. We are well aware of these alternatives, but they are either not cost-effective or couldn’t be provisioned by our hosting provider within 6 hours. We are well aware of the limitations of our current setups and should demand ever rise to a point where the current setup cannot handle the load anymore we will definitely do whatever is necessary to scale it, including considering SSDs and more RAM.

MongoDB shard rebalancing caveats

Unfortunately MongoDB’s shard rebalancing system proved to be slower than we hoped it would be. During rebalancing there was so much disk I/O we could process neither read nor write requests in reasonable time. We were left with no choice but to take the website and the background workers offline during the rebalancing.

After 12 hours the rebalancing still wasn’t done. The original server still had 60% of the data, while the second server had 30% of the data and the third server 10% of the data. Apparently MongoDB performs a lot of random disk seeks and inefficient operations during the rebalancing. Further down time was deemed unacceptable so after an emergency meeting we decided to take the following actions:

  1. Disable the MongoDB rebalancer.
  2. Take MongoDB offline on all servers.
  3. Manually swap the physical MongoDB database files between server 1 and 3, because the latter has both a faster disk array as well as more RAM.
  4. Change the sharding configuration and swap server 1 and 3.
  5. Put everything back online.

These operations were finished in approximately 1.5 hours. After a 1 hour testing and maintenance session we had put Union Station back online again.

Even though MongoDB’s shard rebalancing system did not perform to our expectations, we do plan on continuing to use MongoDB. Its sharding system – just not the rebalancing part – works very well and the fact that it’s schemaless is a huge plus for the kind of data we have. During the early stages of the development of Union Station we used a custom database that we wrote ourselves, but maintaining this database system soon proved to be too tedious. We have no regrets switching to MongoDB but have learned more about its current limits now. Our system administration team will take these limits in mind next time we start experiencing scaling issues.

Our cluster now consists of 3 physical servers:

  • All servers are equipped with 8-core Intel i7 CPUs.
  • The original server has 8 GB of RAM and 2×750 GB harddisks in RAID-1.
  • The 2 new servers have 24 GB of RAM each and 2×1500 GB harddisks in RAID-0 and are dedicated database servers.

Application-level optimizations

We did not spend the hours only idly waiting. We had been devising plans to optimize I/O at the application level. During the down time we had made the following changes:

  • The original database schema design had a lot of redundant data. This redundant data was for internal book keeping and was mainly useful during development and testing. However it turned out that it is unfeasible to do anything with this redundant data in production because of the sheer amount of it. We’ve removed this redundant data, which in turn also forced us to remove some development features.
  • One of the main differentiation points of Union Station lies in the fact that it logs all requests. All of them, not just slow ones. It logs a lot of details like events that have occurred (as you can see in the request timeline) and cache accesses. As a result the request logs we receive can be huge.

    We’ve implemented application-level support for compression for some of the fields. We cannot compress all fields because we need to run queries on them, but the largest fields are now stored in gzip-compressed format in the database and are decompressed at the application level.

How we want to proceed from now

We’ve learned not to underestimate the amount of activity our users generate. We did expect to have to eventually scale but not after 1 day. In order to prevent a meltdown like this from happening again, here’s our plan for the near future:

  • Registration is still closed for now. We will monitor our servers for a while, see how it holds up, and it they perform well we might open registration again.
  • During this early stage of the open beta, we’ve made the data retention time 7 days regardless of which plan was selected. As time progresses we will slowly increase the data retention times until they match their paid plans. When the open beta is over we will honor the paid plans’ data retention times.
  • http://www.bluebox.net Jesse Proudman

    Thanks for such a great writeup on the pitfalls of your launch. It’s always exciting to hear about the strategies employed in a high-stress environment.

    I am curious – With a RAID 0 disk configuration you’re now using, you’re exposed to the failure of either drive (which happens quite a bit more than you’d imagine). What’s your data recovery plan? The diagrams above show sharded Mongo, but not replicated data, so at present, the loss of one of those drives would result in the destruction of 1/4 of your data and potentially the corruption of 1/2.

    Have you thought about using a configuration with more physical drives? Most big data applications Blue Box works with are using at a bare minimum 4-6 drives in a RAID 10. This type of configuration would give you a *significant* boost in IO capacity.

    Beyond more drives, I’d recommend looking at using 15k RPM SAS drives.

    Your hosting company should be able to help guide you in these decisions but we’re happy to provide any guidance if you’d like our input.

    – Jesse

  • http://www.phusion.nl/ Hongli Lai

    Thanks for the input Jesse. We never relied on RAID as a backup strategy in the first place. In RAID 1 both disks are performing almost the same I/O operations so if one disk fails then the other will likely follow soon. Instead we’re planning on replicating data to another server and/or dumping the database contents into remote files from time to time.

  • http://www.bluebox.net Jesse Proudman

    Hongli,

    Excellent. I just wanted to make sure you were aware of the risks.

    In 8 years of our operational history, we’ve seen the second disk in a RAID 1 fail during the rebuild of the primary only once and in that scenario, I believe we were still able to recover that data. The main attractor for redundant RAID configurations is avoiding having the recovery scramble of a downed box from a disk failure. We proactively are monitoring every disk in our data center, and have them swapped out on failure very quickly. This avoids application outages, having to futz with restoring from backups and the inevitable data loss since the date of your last backup, etc.

    Full replica sets would also solve this issue for you and make the environment truly HA, but since redundant disk setups should add such a marginal cost to the total environment config, we’ve often found that the reduced headache to be very much worth it.

    We look forward to seeing the continued documentation on the platform! Great work.

    – Jesse

  • Tom M

    Thanks for the update guys.

    I have to say I am quite surprised that you guys thought that this would run on one server of that spec. Like Jesse said, I would be throwing RAID-5 / 6 /10 at it, and I would certainly look at moving away from RAID-0 as soon as possible, it’s simply not a good idea in a production environment.

    Keep your eye on ext4 as a file system too, it looks like it should be good for the future as an alternative to XFS.

    On the bright side, I guess you are lucky to have a lot of interest, rather than very little interest!

  • http://www.phusion.nl/ Hongli Lai

    @Tom: we didn’t plan on running on a single server forever. The plan was to run on a single server for a week or two, see whether people are interested in Union Station, and if so add more servers for high availability etc. That’s why we launched it as a beta and not as a final. We are well aware of the risks of running on a single server. Rest assured that the final paid version will have multiple backups and redundant systems.

  • http://www.bluebox.net Jesse Proudman

    FWIW – I think XFS was the proper FS choice. EXT4 has some nasty data corruption issues that we’re not comfortable with, and XFS for database boxes outperforms EXT3.

  • Pingback: RAID 0 SATA with 2 Drives: It’s Web Scale! | Brent Ozar – Too Much Information | Brent Ozar - Too Much Information

  • linuxy guy

    RAID10,f2 will may give better performance than RAID1 on your two-disk array. What were your considerations when choosing only a two-disk array, and when choosing to use RAID1?

  • http://www.phusion.nl/ Hongli Lai

    Our hosting provider has limited options especially because we needed new servers within several hours. If we have weeks then custom hardware combinations are possible.

  • Darryl Stoflet

    Great article. But where can I get me one of them 8 core i7s. And longer term you may want to look into SATA III 6gb SSD drives in raid 5/6/10. Costly yes but if you bottleneck is exclusively disk IO then SATA III SSD has the best bang for the buck

  • Theo

    What drivers are you using for RabbitMQ and Mongo? It seems to me that the best RabbitMQ driver is the amqp gem, which is EventMachine-based, but em-mongo, the EventMachine-based Mongo driver isn’t fantastic, and has a very limited feature set. At the same time, the available non-EventMachine-based RabbitMQ drivers are really old. Do you use amqp + em-mongo, or carrot (or bunny?) + mongo?

    I’ve been working on a system whose infrastructure looks very much like yours (completely different application though), and my main problem is selecting drivers that work well together.

    The RabbitMQ people are working on a new Ruby driver, thankfully, but it’s months off, so I’m really interested in your experiences.

  • http://www.phusion.nl/ Hongli Lai

    For RabbitMQ we’re using a forked version of Bunny that allows cancellation: https://github.com/FooBarWidget/bunny. It’s old and doesn’t appear to be maintained anymore but it works and gets the job done.
    For MongoDB we just use the mongo gem. Remember that a few months ago we contributed to the mongo gem, making it between 2x and 150x faster? That was for Union Station.
    No EventMachine, we don’t need it.

  • Theo

    I had completely disregarded Bunny since it hadn’t been updated in years, but I just looked at the repo’s network page and the RabbitMQ guys took over the project just the other day. Looks like it’s the one to go for.

  • http://blog.101ideas.cz Botanicus

    Yes, we plan to re-implement bunny on top of the current low-level amq-client gem. We got blessing by Chris Duncan, the original author, so there shouldn’t be any problems. However it has got lower priority than the AMQP gem, therefore we didn’t start any serious work on bunny yet.

  • Han

    Honli, thanks for the post, very informative.

    Are you using mongodb 1.6 or 1.8?

    Thanks

    Han Qiu

  • Pingback: RabbitMQ + Cloud Foundry: Cloud Messaging that Just Works | ZenShenO's Road...

  • Pingback: RabbitMQ 搭配 Cloud Foundry: 成就简单奏效的云消息传送解决方案 | 博客