In Technology Preview 1 we’ve shown that Phusion Passenger 3 can be up to 55% faster on OS X. Performance is good and all, but it won’t do you any good unless the software keeps running. When Phusion Passenger goes down it’s an annoyance at best, but in the worst case any amount of down time can cost your organization real money. Any HTTP request that’s dropped can mean a lost transaction.
Although stability, robustness and availability aren’t as hot and fashionable as performance, for Phusion Passenger 3 we have not neglected these areas. In fact we’ve been working on hard on implementing additional safeguards, as well as refactoring our designs to make things more stable, robust and available.
In Phusion Passenger 2.2′s architecture, there are a number of processes that work together. At the very front there is the web server, which could consist of multiple processes. If you’ve ever typed
passenger-memory-stats then you’ve seen the web server processes at work. Apache typically has a dozen of processes (prefork MPM) and Nginx typically has 3.
For Phusion Passenger to work, there must be some kind of global state, shared by all web server processes. In this global state information is stored such as which Ruby app processes exist, which ones are currently handling requests, etc. This allows Phusion Passenger to make decisions such as which Ruby app process to route a request to, whether there is any need to spawn another process, etc. This global state exists in a separate process which all web server processes communicate with. On Apache this is the ApplicationPoolServerExecutable, on Nginx this is the HelperServer. For simplicity’s sake, let’s call both of them HelperServer (in Passenger 3 they’ve both been renamed to PassengerHelperAgent). The HelperServer is written in C++ and is extremely fast and lightweight, consuming only about 500 KB of real memory.
As you can see the HelperServer is essentially the core of Phusion Passenger. The problem with 2.2 is that if the HelperServer goes down, Phusion Passenger goes down with it entirely. Phusion Passenger will stay down until the web server is restarted. For various architectural reasons in Apache and Nginx, it is not easily possible to restart the HelperServer upon a crash in a reliable way.
Now, why would the HelperServer ever crash?
- Bugs. We are humans too and we can make mistakes, so it’s possible that there are crasher bugs in the HelperServer. In the past 2 years we’ve spent a lot of effort into making the HelperServer stable. For example we check all system calls for error results, and we’ve spent a lot of effort into making sure that uncaught exceptions are properly logged and handled. However one can never prove that a system is entirely bug free. We aren’t aware of any crasher bugs at this time but they might still exist.
- Limited system resources. For example if the system is very low on memory, the kernel will invoke the Out-Of-Memory Killer (OOM Killer). Properly selecting a process to kill in low-memory conditions is actually a pretty hard problem, and more often than not the OOM Killer selects the wrong process, e.g. our HelperServer.
- System administrator mistakes. Passing the wrong PID to the kill command and things like that.
- System configuration problems, hardware problems (faulty RAM and stuff) and operating system bugs.
There are some people who have reported problems with the HelperServer. Their HelperServer crashes tend to happen sporadically and they usually cannot reproduce the problem reliably themselves. For many people (2) is often the cause of HelperServer crashes, and increasing the amount of swap is reported to help, but for other people the problems lie elsewhere.
The crashes aren’t always our fault (i.e. not bugs), but they are always our problem. It saddens us to say that we’ve been unable to help these people so far because we simply cannot reproduce their problems even when we mimic their system configuration.
But this is going to change.
Enter Phusion Passenger 3 with self-healing architecture
Phusion Passenger 3 now introduces a lightweight watchdog process into the architecture. It monitors both the web server and the HelperServer. If the HelperServer crashes, then the watchdog restarts the HelperServer immediately.
Of course if the watchdog is killed then it’s still game over, but we’ve taken extra care in terms of code to try to make this extremely unlikely to happen. The watchdog for starters is extremely lightweight, even more so than the HelperServer. It is written in C++ and uses about 150-200 KB of memory. Its only job is to start the HelperServer and other Phusion Passenger helper processes and to monitor them. The codebase extensively uses C++ idioms that promote code stability, such as smart pointers and RAII. By employing heavy testing as well, we’re expecting to have brought the possibility that the watchdog contains crashing bugs to a minimum. The small footprint and the fact that it does nothing most of the time minimizes the chances that it’s killed by the OOM Killer. In fact, if the watchdog is running on Linux and has root access, it will register itself as not OOM-killable.
No longer will HelperServer crashes take down Phusion Passenger, even if the crash isn’t our fault.
Restarts are fast
It only takes a few hundred miliseconds to restart the HelperServer.
Crashing signals are logged
If the HelperServer crashes then the watchdog will tell you whether it crashed because of a signal, e.g. SIGSEGV. This makes it much easier for system administrators to see why a component crashed so that they might fix the underlying cause. In Phusion Passenger 2.2 this was not possible.
Upon shutting down or restarting the web server, Phusion Passenger 2.2 gracefully notifies application processes to shut down. It does not force them to. This would pose a problem for broken web applications that don’t shut down properly, e.g. web applications that are stuck in an infinite loop, stuck in a database call, etc.
Phusion Passenger 3 guarantees that all application processes are properly shut down when you shutdown/restart the web server. It gives application processes a deadline of 30 seconds to shutdown gracefully; if any of them fail to do that, they’ll be terminated with SIGKILL.
This mechanism works so well that it even extends to background processes that have been spawned off by the web application processes. All of those processes belong to the same process group. Phusion Passenger sends SIGKILL to the entire process group and terminates everything. No longer will you have to manually clean up processes; you can be confident that everything is gone if you shutdown/restart the web server.
Zero-downtime web server restart
In Phusion Passenger 2.2, whenever you restart the web server, HTTP requests that are currently in progress are dropped and the clients receive ugly “Connection reset by server” or similar error messages. This can be a major problem for large websites, because during the 1 second that Phusion Passenger is restarting hundreds of people could be getting errors. If your visitor happens to be clicking on that “Buy” button, well, tough luck.
In Phusion Passenger 3 we’ve implemented zero-downtime web server restart. Phusion Passenger and the web server are restarted in the background, and while this is happening, the old web server instance (with the old Phusion Passenger instance) will continue to process requests.
The architecture is actually a little bit more complicated than what’s shown in the diagram because behind the web server there are a bunch of Phusion Passenger processes, but you get the gist of it.
When the new web server (along with the new Phusion Passenger) has been started, it will immediately begin accepting new requests. Old requests that aren’t finished yet will continue to be processed by the old web server and Phusion Passenger instance. The old instance will shut down 5 seconds after all requests have been finished, to counter the possibility that the kernel still has leftover requests in the socket backlog that hit the old instance after its done processing everything already in its queue.
This works so well that we can restart the web server while running an ‘ab’ benchmark with 100 concurrent users without dropping a single request!
Zero-downtime application shutdown
Suppose that a Ruby application process has gone rogue and you want to shut it down. The most obvious way to do that is by sending it a SIGTERM or SIGKILL signal. However this would also abort whatever request it is currently processing.
In Phusion Passenger 2.2, you could also send SIGUSR1 to the process, causing it to shut down gracefully after it has processed all requests in its socket backlog. However this introduces two problems:
- If the website is very busy then the process’s socket backlog will never be empty, and so the process will never exit.
- Exiting after the process has detected an empty backlog can introduce a race condition. Suppose that, right after the process has determined that its backlog is empty but before it has shutdown completely, Phusion Passenger tries to send another request to the process. This request would be lost.
In Phusion Passenger 3, SIGUSR1 will now cause the application process to first unregister itself so that Phusion Passenger won’t route any new requests to it anymore. It will then proceed with exiting 5 seconds after its socket backlog has become empty. This way you can gracefully shutdown a process without losing a single request.
Although Phusion Passenger has been powering many high-traffic Ruby websites for a while now, some people still have some doubts about whether Phusion Passenger is fit for production. Instead of using words convince them, we would rather convince them with real results. Phusion Passenger 3 raises the bar in the areas of performance, stability, robustness and availability yet higher, but it doesn’t stop here. Please stay tuned for the next Technology Preview in which will unveil even more of Phusion Passenger 3.