Phusion white papers Phusion overview

The Road to Passenger 3: Technology Preview 2 – Stability, robustness, availability, self-healing

By Hongli Lai on June 18th, 2010

In Technology Preview 1 we’ve shown that Phusion Passenger 3 can be up to 55% faster on OS X. Performance is good and all, but it won’t do you any good unless the software keeps running. When Phusion Passenger goes down it’s an annoyance at best, but in the worst case any amount of down time can cost your organization real money. Any HTTP request that’s dropped can mean a lost transaction.

Although stability, robustness and availability aren’t as hot and fashionable as performance, for Phusion Passenger 3 we have not neglected these areas. In fact we’ve been working on hard on implementing additional safeguards, as well as refactoring our designs to make things more stable, robust and available.

Self-healing

In Phusion Passenger 2.2’s architecture, there are a number of processes that work together. At the very front there is the web server, which could consist of multiple processes. If you’ve ever typed passenger-memory-stats then you’ve seen the web server processes at work. Apache typically has a dozen of processes (prefork MPM) and Nginx typically has 3.

For Phusion Passenger to work, there must be some kind of global state, shared by all web server processes. In this global state information is stored such as which Ruby app processes exist, which ones are currently handling requests, etc. This allows Phusion Passenger to make decisions such as which Ruby app process to route a request to, whether there is any need to spawn another process, etc. This global state exists in a separate process which all web server processes communicate with. On Apache this is the ApplicationPoolServerExecutable, on Nginx this is the HelperServer. For simplicity’s sake, let’s call both of them HelperServer (in Passenger 3 they’ve both been renamed to PassengerHelperAgent). The HelperServer is written in C++ and is extremely fast and lightweight, consuming only about 500 KB of real memory.

As you can see the HelperServer is essentially the core of Phusion Passenger. The problem with 2.2 is that if the HelperServer goes down, Phusion Passenger goes down with it entirely. Phusion Passenger will stay down until the web server is restarted. For various architectural reasons in Apache and Nginx, it is not easily possible to restart the HelperServer upon a crash in a reliable way.

Now, why would the HelperServer ever crash?

  1. Bugs. We are humans too and we can make mistakes, so it’s possible that there are crasher bugs in the HelperServer. In the past 2 years we’ve spent a lot of effort into making the HelperServer stable. For example we check all system calls for error results, and we’ve spent a lot of effort into making sure that uncaught exceptions are properly logged and handled. However one can never prove that a system is entirely bug free. We aren’t aware of any crasher bugs at this time but they might still exist.
  2. Limited system resources. For example if the system is very low on memory, the kernel will invoke the Out-Of-Memory Killer (OOM Killer). Properly selecting a process to kill in low-memory conditions is actually a pretty hard problem, and more often than not the OOM Killer selects the wrong process, e.g. our HelperServer.
  3. System administrator mistakes. Passing the wrong PID to the kill command and things like that.
  4. System configuration problems, hardware problems (faulty RAM and stuff) and operating system bugs.

There are some people who have reported problems with the HelperServer. Their HelperServer crashes tend to happen sporadically and they usually cannot reproduce the problem reliably themselves. For many people (2) is often the cause of HelperServer crashes, and increasing the amount of swap is reported to help, but for other people the problems lie elsewhere.

The crashes aren’t always our fault (i.e. not bugs), but they are always our problem. It saddens us to say that we’ve been unable to help these people so far because we simply cannot reproduce their problems even when we mimic their system configuration.

But this is going to change.

Enter Phusion Passenger 3 with self-healing architecture

Phusion Passenger 3 now introduces a lightweight watchdog process into the architecture. It monitors both the web server and the HelperServer. If the HelperServer crashes, then the watchdog restarts the HelperServer immediately.

Of course if the watchdog is killed then it’s still game over, but we’ve taken extra care in terms of code to try to make this extremely unlikely to happen. The watchdog for starters is extremely lightweight, even more so than the HelperServer. It is written in C++ and uses about 150-200 KB of memory. Its only job is to start the HelperServer and other Phusion Passenger helper processes and to monitor them. The codebase extensively uses C++ idioms that promote code stability, such as smart pointers and RAII. By employing heavy testing as well, we’re expecting to have brought the possibility that the watchdog contains crashing bugs to a minimum. The small footprint and the fact that it does nothing most of the time minimizes the chances that it’s killed by the OOM Killer. In fact, if the watchdog is running on Linux and has root access, it will register itself as not OOM-killable.

No longer will HelperServer crashes take down Phusion Passenger, even if the crash isn’t our fault.

Restarts are fast

It only takes a few hundred miliseconds to restart the HelperServer.

Crashing signals are logged

If the HelperServer crashes then the watchdog will tell you whether it crashed because of a signal, e.g. SIGSEGV. This makes it much easier for system administrators to see why a component crashed so that they might fix the underlying cause. In Phusion Passenger 2.2 this was not possible.

Guaranteed cleanup

Upon shutting down or restarting the web server, Phusion Passenger 2.2 gracefully notifies application processes to shut down. It does not force them to. This would pose a problem for broken web applications that don’t shut down properly, e.g. web applications that are stuck in an infinite loop, stuck in a database call, etc.

Phusion Passenger 3 guarantees that all application processes are properly shut down when you shutdown/restart the web server. It gives application processes a deadline of 30 seconds to shutdown gracefully; if any of them fail to do that, they’ll be terminated with SIGKILL.

This mechanism works so well that it even extends to background processes that have been spawned off by the web application processes. All of those processes belong to the same process group. Phusion Passenger sends SIGKILL to the entire process group and terminates everything. No longer will you have to manually clean up processes; you can be confident that everything is gone if you shutdown/restart the web server.

Zero-downtime web server restart

In Phusion Passenger 2.2, whenever you restart the web server, HTTP requests that are currently in progress are dropped and the clients receive ugly “Connection reset by server” or similar error messages. This can be a major problem for large websites, because during the 1 second that Phusion Passenger is restarting hundreds of people could be getting errors. If your visitor happens to be clicking on that “Buy” button, well, tough luck.

In Phusion Passenger 3 we’ve implemented zero-downtime web server restart. Phusion Passenger and the web server are restarted in the background, and while this is happening, the old web server instance (with the old Phusion Passenger instance) will continue to process requests.

The architecture is actually a little bit more complicated than what’s shown in the diagram because behind the web server there are a bunch of Phusion Passenger processes, but you get the gist of it.

When the new web server (along with the new Phusion Passenger) has been started, it will immediately begin accepting new requests. Old requests that aren’t finished yet will continue to be processed by the old web server and Phusion Passenger instance. The old instance will shut down 5 seconds after all requests have been finished, to counter the possibility that the kernel still has leftover requests in the socket backlog that hit the old instance after its done processing everything already in its queue.

This works so well that we can restart the web server while running an ‘ab’ benchmark with 100 concurrent users without dropping a single request!

Zero-downtime application shutdown

Suppose that a Ruby application process has gone rogue and you want to shut it down. The most obvious way to do that is by sending it a SIGTERM or SIGKILL signal. However this would also abort whatever request it is currently processing.

In Phusion Passenger 2.2, you could also send SIGUSR1 to the process, causing it to shut down gracefully after it has processed all requests in its socket backlog. However this introduces two problems:

  • If the website is very busy then the process’s socket backlog will never be empty, and so the process will never exit.
  • Exiting after the process has detected an empty backlog can introduce a race condition. Suppose that, right after the process has determined that its backlog is empty but before it has shutdown completely, Phusion Passenger tries to send another request to the process. This request would be lost.

In Phusion Passenger 3, SIGUSR1 will now cause the application process to first unregister itself so that Phusion Passenger won’t route any new requests to it anymore. It will then proceed with exiting 5 seconds after its socket backlog has become empty. This way you can gracefully shutdown a process without losing a single request.

Conclusion

Although Phusion Passenger has been powering many high-traffic Ruby websites for a while now, some people still have some doubts about whether Phusion Passenger is fit for production. Instead of using words convince them, we would rather convince them with real results. Phusion Passenger 3 raises the bar in the areas of performance, stability, robustness and availability yet higher, but it doesn’t stop here. Please stay tuned for the next Technology Preview in which will unveil even more of Phusion Passenger 3.

  • Carl

    When do you plan to release any beta or RC?

  • http://www.phusion.nl/ Hongli Lai

    Soon. That’s why we’re blogging about it.

  • http://www.phusion.nl/ Ninh Bui

    @Carl:
    Probably not the answer you were looking for, but “When it’s done”. We’re currently performing beta tests with some of the most high performing Ruby environments out there to make sure Phusion Passenger will work as expected out of the box when we put it in a release.

    Please bear with us a little longer, we’re working as hard as we can, for what it’s worth, it shouldn’t take TOO long from now anymore.

  • http://www.akitaonrails.com AkitaOnRails

    Awesome work guys! This new architecture is really juicy and I like it a lot. Can’t think of a better deployment solution. Very excited to give it a run asap :-)

  • http://michaelvanrooijen.com/ Michael van Rooijen

    I never really had any issues with Phusion Passenger 2, and Phusion Passenger 3 as mentioned “raises the bar” by far. Actually every topic covered in this article is really interesting and is a great addition to Phusion Passenger. Really like this “self-healing” concept. It really makes it easy on the user that’s actually “using” Phusion Passenger.

    Thanks for the updates, great material!

  • http://matiaskorhonen.fi Matias Korhonen

    How does the performance match up to an nginx+unicorn setup?

  • http://www.phusion.nl/ Ninh Bui

    @Matias Korhonen:

    Passenger 2.x is on par in terms of performance with other deployment solutions. Passenger 3.x is up to 55% faster than 2.x, you do the math ;-)

  • Jensen

    Your ideas are so awesome. Thanks for the good work!

  • http://ryansobol.com Ryan Sobol

    I’m impressed with the detailed and pragmatic thought we’re seeing in these technology previews. As a system administrator and application developer, I’m looking forward to putting the next major release of Passenger to the test in my production environments. Keep up the great work Phusion team!

  • Cary

    Looks great. Keep up the awesome work

  • http://github.com/mitchellh Mitchell Hashimoto

    Looks great! Question: Why aren’t you guys using two ‘watchdog’ processes with one being the master and one being the failover backup? That way the two watchdog processes watch each other and the master watches the HelperServer. This gets rid of the “if the watchdog process is killed, you’re still out of luck” issue.

  • http://www.phusion.nl/ Ninh Bui

    @Mitchell Hashimoto:

    So what should happen if the watchdog of the watchdog crashes? Should we have one for that one too? ;-) All kidding aside, the chances that the watchdog dies are so unlikely that it should suffice as is. The watchdog process has been reduced in complexity as far as we can get it, so adding another watchdog would make it at least as complex without any gains.

  • http://samsoff.es Sam Soffes

    This sounds awesome! Can’t wait to try it out!

  • Justin

    Does this mean that we could restart the webserver somehow as part of the deploy script to have zero downtime deployments?

  • Tom Robinson

    Presumably the 2 watchdogs would watch each other. It’s very unlikely both would be killed simultaneously.

  • http://thoughtbot.com Jason Morrison

    It’s great to see continued improvement on these areas! Passenger is excellent.

    I’m curious, though – why did you choose to write Watchdog instead of recommending a tool like monit/god/bluepill?

  • http://www.phusion.nl/ Hongli Lai

    Monit/God/Bluepill are designed for long-running daemons. In contrast, the Phusion Passenger helper processes are designed to be started and shutdown along with the web server. Furthermore the watchdog is responsible for setting up the process group which is what allows us to clean up everything, including runaway background processes spawned by the application. This cannot be done with Monit/God/Bluepill.

  • http://kablingy.ie Steve Quinlan

    As one of the people affected by the HelperServer crashes, I’m looking forward to the new release. For now I’ve switched to nginx+thin as I still get once crash a day on passenger 2, but look forward to switching back to passenger 3.
    Thanks for all of your efforts.

  • adam

    Looks awesome !!

    I’m with Mitchell Hashimoto and Tom Robinson … two watchdogs may enhance more the robustness of PP3 :-)

    Great job Phusion … can’t wait for the release.

  • http://gaveen.owain.org Gaveen

    Great work guys. One request though on behalf of all the SysAdmin/Ops crowd. Can you please provide an alternative installation means other than the scrit/installer approach? At least a proper building from source guide. Those tow things can make things lot easier to maintain within a configuration managed environment and packaging with Linux distros.

  • http://www.phusion.nl/ Hongli Lai

    Gaveen, Brightbox provides Ubuntu packages.

    Passenger 3 will also provide ways to allow you to create your own packages more easily.

  • http://www.phusion.nl/ Ninh Bui

    @adam @Tom Robinson
    The watchdog process is already very minimal in footprint; it’s therefor very unlikely for it to go down, and if it goes down, we’d be interested in finding out what caused it. In other words, in this situation, if code fails, it should fail hard so that we can fix it. The watchdog process just shouldn’t go down, if it does, it’s a good indication that something is probably seriously wrong with the system.

    Now, if we were to introduce another watchdog for the watchdog, they’d both not be very different in terms of code and so, if there is an error in the watchdog code that allows for it to go down, it’d not have any gains to have another watchdog which in terms of codes is identical… both watchdogs would still go down in that situation. I hope this clarifies our rationale of why it won’t make a lot of sense to have a watchdog for a watchdog.

  • adam

    Thanks Ninh for the answer … now I see and understand better why this choice.

  • Pingback: Особое программирование » Post Topic » The Road to Passenger 3

  • Jeff

    It seems like an odd decision to use C++ for the watchdog process if that is really the only piece of the system that should never fail. No matter how much you try to protect yourself with smart pointers or whatever else, I find it hard to believe that this is a more safe and stable way to go than using some kind of garbage collected language with more safeguards. Requiring the JVM would probably be too heavyweight, but why not use a little ruby 1.9 process, or even a Lua process. Surely the watchdog isn’t under tight timing requirements since it’s basically just a heartbeat and restart service.

  • http://www.phusion.nl/ Hongli Lai

    @Jeff: Nginx is written in C. Has it ever crashed for you? Your OS kernel is written in C or C++, how often does that crash? Our watchdog’s line count doesn’t come anywhere near Nginx’s or your OS kernel’s.

    Furthermore, avoiding the OOM killer requires a very small footprint. Only C and C++ (and assembly) allow you to do that.

  • http://gaveen.owain.org Gaveen

    @Hongli Lai: I was talking about packaging in general and pushing into official repos (opposed to 3rd party repos) where it’s maintained as a part of the dostro.

    While BrightBox Debs are great, not everyone runs Ubuntu on production servers. So to have Passenger in official distro repos will be a good thing for Passnger adoption, as well as Ruby based webapp adoption within distros.

    Anyway it’s good news about Passenger 3. :)

  • http://zyphmartin.com Brandon Martin

    This is awesome. Great work and thank you. Passenger 2 has given me no problems at all other then restarts resetting connections so I am really excited to get Passenger 3 in production.

    Thanks again!

  • http://www.supersaas.com Jan M

    Looks awesome guys, can’t wait to start using this.

  • http://brynary.com Bryan Helmkamp

    Looks great. Thanks for all the hard work!

    Will it be possible to do a graceful app restart where the old processes continue servicing requests until the new processes are ready?

    This has been important for me before when the app booting process itself takes long enough that a request queued for that time is effectively lost. Unicorn supports this, I believe.

    -Bryan

  • Pingback: June 21, 2010: Double Double Splat Splat « Rails Test Prescriptions Blog

  • http://www.felixogg.nl Felix Ogg

    This discussion started on Twitter, where I called your attitude “pompous”, to my close friends, but open to see for people on Twitter. That is un-chique, so NinH rightfully claimed he was defenseless – be it a bit sensitive, but still. Furthermore, I don’t have time to make this into a lengthy discussion, it serves neither of us, so don’t count on us getting to an agreement.

    I will explain my wording.

    But first, you need to know that I think Passenger is a good product, a valuable addition to the open source world. It is by far the best way to serve up a rails app. I use it myself.

    I’ve seen you guys presenting, I read some blog posts, so I tend to generalize. Again, that’s unfair, so I’ll stick to just this one post, technology preview 2.
    Pompous means making yourself look ‘annoyingly self-important’, but it’s of course highly opinionated. Henceforth, I can only give two key pieces of constructive feedback that would make your posts/presentations seem LESS POMPOUS to me (I.E. less annoying).

    1. Point readers to the good ideas you borrowed, give credit.
    (You did not invent the concept Watchdog, nor the SIGTERM/SIGKILL signalling, nor the rolling updates for multiple server instances. You just re-implemented these. )

    BECAUSE:
    – If you provide the reference you can explain it in the “for dummies” style like you do so well, but the references earn trust from serious (“enterprisy”?) people. You need the serious people to grow the community.
    -If you provide the reference, maybe you will inspire someone to read them, who can actually apply the same concepts, AS WELL AS OTHER CONCEPTS FOUND IN THE SAME BOOK/SOURCE. While looking up the sources of the stuff you borrowed, they will find solutions to their OTHER PROBLEMS and – again, implement them for the greater good.

    By leaving out references you imply that you ‘invented’ stuff, which you clearly haven’t, which looks self-important (and ignorant) to people who know this stuff.
    BTW: If you are now saying: “But really, I never read about this, I just built it”, then – indeed you are re-inventing the wheel and you really should read more books.

    2. Separate your marketing fluff from your educational writing
    (If you are making time to educate people, don’t get caught inserting unsupported marketing claims in between. For example this is pure nonsense, and very much unneeded:

    “The codebase extensively uses C++ idioms that promote code stability, such as smart pointers and RAII. Because of this, the possibility that the watchdog contains crashing bugs is extremely low.”

    How extremely low? Once in a million years? How do you know that? Low compared to what? Aren’t you essentially saying “well we use a programming language, and some libraries, we followed other smart people’s guidelines and since it’s only a small codebase, and because we tested it quite a lot, we HOPE there are fewer bugs in that component, than in the average work WE deliver.” That’s about what you have, anything more, is plain self-important.

    BECAUSE
    – You write to educate others, to show by example how Phusion strives to take Rails to higher levels. And probably to inspire others to do the same, with similar openness. If you start ‘selling’ yourself or your product, inspiration is lost. Whom did you learn most from at University; the all-knowing arrogant professor, or the modest and forgiving Ph.D. assistant?
    – You loose your credibility whenever people find out you are presenting guesses as facts.

    HOW
    – We know you’re talking about Phusion, I like your honesty, please leave out the self-indulgant “we’re sooooo much better than everyone else out there” statements. I would not be reading your post, if I was not contemplating testing Phusion 3 in beta, or researching its sources to find a solution to my problem.

    For example, don’t get caught raising bars that are over your head. Let me LOWER yours a little for you:

    ” Instead of using words convince them, we would rather convince them with real results. Phusion Passenger 3 raises the bar in the areas of performance, stability, robustness and availability yet higher, but it doesn’t stop here.”

    Phusion Passenger 3 is here implicitly compared to other Rails servers, of which – frankly – there aren’t too many. You are not raising any bars in the minds of the people doubting “Ruby in Production”. And it’s counter-productive to blabber like this to them. Their bars are actually a lot higher, they consider your technology ancient, and – like me – wonder why you didn’t put it in in the first place. How can you do without it? What else are you missing, that is required to make seriously heavy production systems stable and robust?

    Finally, I find it striking that you even bring up raising bars, instead of just being honest and saying something like:

    “Instead of using words to convince people, we prefer to be judged by our results. We challenge any Phusion 2 site, as heavy as can be, to try Phusion 3, and try out the test scenarios we featured in this article. If you get Phusion 3 to drop a single request, we’ll buy you a pizza. Success or failure, in either case, we expect you to blog it. :-)”

    To catch all this in two sentences:
    Microsoft tried to repeat that their product “Bob” was so awesome it was beyond awesomeness. This MS Bob technology inside! Apple’s iPhone just makes you quiet the first time you use it, while Jobs says “This is nice huh? It’s running BSD Unix, like our computers. That’s as stable as we can make it for you.” References and technological modesty. There you have it.

  • http://iain.nl/ iain

    You talk about modesty and cite Apple Inc.? The guys who “reinvented the phone”?

  • http://www.phusion.nl/ Ninh Bui

    @Felix Ogg:
    First off, I’m kind of honored that someone who claims to not have a lot of time invests a massive amount of bytes to form a comment ;-) Anyway, thanks for sharing your thoughts, it’s much appreciated.

    Having said that, below you’ll find my reply to your “points”:

    1. First off, as repeatedly stated to you on twitter and something I’d like to underline in this comment, we never remotely claimed we invented these concepts, nor would we dare to claim such a thing as having invented signaling. Not only because that would be a gross falsity but just also for the fact that we respect our audience in having the knowledge in something like signal handling.

    Furthermore, we’ve named the concepts by their proper name so anyone who would have the scrutiny to be interested in these concepts would easily be able to perform a google search to find scientific papers on these matters in all sorts of forms, should they desire to do so. In fact, they are so numerous, I wouldn’t even know where to begin with regards to referencing. Call me a fool, but I believe our audience is scrutinous and smart enough to do their own part of research as well ;-).

    Now, let’s flip the burger, should one who refers to the observer pattern also make reference to Gamma et al’s Gang of Four design pattern book as well as their first scientific paper on these things in an effort to not come across as being pompous? Am I pompous because I don’t do:

    // Oh this is a derivative of the observer pattern, which was originally created
    // by Erich Gamma et al. or at least formally described by them  in their excellent
    // Gang of Four design pattern book. I didn't invent this!
    // ISBN 0-201-63361-2
    class Foo : Observer {
    
    };
    

    Should “man kill” also be called pompous because it does not reference to the theory behind signal handling in processes? If you think so too, then I guess there is no point in discussing this any further as we’ll fundamentally differ in opinions on these matters. By that logic, you’d probably find nearly all developers pompous then I guess ;-)

    Furthermore, you make mention of us not reading enough books, yet you claim we are the ones being pompous… I find that kind of interesting especially when you referenced a paper by Philips from 2006 as proof of fault tolerant systems predating our blog post, to which I replied that Joe Armstrong implemented these concepts as early as in 1986 in Erlang. Undoubtedly, the latter was preceded by other efforts as well as I’ve been able to find papers going back to the 70s on this topic on ACM with ease.

    2. Let me just reply to the sentence you quoted and try to elaborate on that. In fact, our blog has a commenting system just for that purpose. There are a few factors why we make this claim:

    The watchdog is extremely lightweight, i.e. less than 1000 LOC, most of it being C++ boilerplate and error checking. With code that is this lightweight it’s very feasible to test this extremely well, which we have done indeed. In fact, most of the code of this watchdog is actually to test that conditions pass as expected and get handled correctly as well. There are of course a myriad of anomalies that could lead applications to crash, but we’ve made an extremely big effort in getting this minimized when it comes to code. The use of idioms and libraries that have been integrated in far larger projects than Passenger and have had a significant effect with regards to stability (e.g. reducing the possibility of segfaults by using smart pointers and preventing memory leaks by using RAII) is something we wish to underline. To say it is only speculation however would be doing it a disservice, as I’ve mentioned in one of the earlier comments, we’re currently beta testing Passenger not only in our environments but also at some of the most demanding Ruby environments out there to make sure that it is holding its own in those environments as well.

    It is for this reason that I don’t entirely agree with your paraphrasing as it seems to be overseeing the fact that we’re actively battle testing this in various live production environments (some of the most demanding in fact) right now as we speak and have done so for quite some time now. In fact, we’ve been testing well over half a year now, and we firmly believe that by using these techniques and doing intensive testing we’ve reduced the possibilities significantly of allowing the watchdog to contain crash bugs especially in contrast to the alternate scenario that we would NOT have used any of these techniques.

    Beta test results seem to back this up for the moment, but I understand what you’re trying to get at: indeed, I suppose I can’t give you a probability such that P(X=”watchdog crashes”)<0.0001 as a definition of “extremely low”, so I will edit this in the blog post to reflect that: instead of “is” I’ll use “expect” in the sentence with regards to the watchdog. However, I find it interesting that you seem to take such extreme offense of this as you’re quoting Apple who are well known for making far bolder claims like “reinventing the phone”. I’d only think it would be fair if you’d email Steve Jobs right now too to get a scientific backing on that claim. ;-)

    3. It’s funny that you first mention that you find Phusion Passenger a good product, but at the end of your comment call it ancient and are apparently capable of speaking of behalf of all our users/clients. That’s an indication of a troll I guess, and shame on me for indulging you with this comment ;-).

    Now, in particular, you mention “Their bars are actually a lot higher, they consider your technology ancient, and – like me – wonder why you didn’t put it in in the first place”. Would you like me to find a cure for cancer too while I’m at it? (Trust me, if I could, I would but the fact remains that we’re mere mortals, not wizards).

    All kidding aside, I could go on for a while, but you know what I’m thinking instead? Phusion Passenger has been an open source project from the very beginning, for over two years now in fact. If there were particular features that you needed in there, there would’ve been plenty of opportunity for you to implement these yourself. So what’s your excuse for not doing so? ;-)

  • http://tompurl.com Tom Purl

    Wow, all of this looks really great! I have a few questions about the “Zero-downtime web server restart” feature though.

    First, what failure mode does this new feature prevent? And in what scenario would I use it? For example, when I restart a web server, I usually do it for one of two reasons:

    1) The web server has crashed or is not responsive
    2) I have deployed new code or changed a properties file

    Obviously, the first scenario would be handled by the watchdog process. So is scenario #2 handled by this “Zero-downtime web server restart” feature? Can I make an arbitrary change to my application, and have the old and new versions hosted simultaneously?

  • http://www.phusion.nl/ Hongli Lai

    Tom:
    1) Actually our watchdog does not restart the web server, it just restarts Phusion Passenger components. If you want to monitor the web server you need to use stuff like Monit or Daemontools.
    2) What you’re describing is rolling application restarts. That’s a whole other thing, not related to this.

    Instead zero-downtime web server restart is for when you’ve modified the web server configuration file and needs the web server to reload it.

  • Pingback: Advanced Capistrano usage | Dmytro Shteflyuk's Home

  • http://www.maach.eu Rachid Al Maach

    This sounds great! can’t wait to try and use it! great job

  • Pingback: Особое программирование » Post Topic » Phusion Passenger 3 Released: The Next Gen of Ruby Webapp Deployment

  • http://www.dburry.com/ David Burry

    You guys’ twama llama turn threadnaught is classic, I haven’t eye-rolled-loled so hard in ages. Thanks for making my day! :)

    And keep up the good work improving passenger, I like where things are going in general… It shows you do listen to what people want and need, once you are convinced. Which is exactly the way you should be doing it as keeper of your own open codebase.

    And if anyone thinks the work should be done very differently or at a very different pace, they are free to do better. And while that is an appropriate response to a troller, thanks for not using that as a pat answer to every helpful suggestion and discourse about possible ways to improve the product! You guys rock!

    In regards to the whole “watchdog of the watchdog” thing… I would expect a reasonable two-watchdog implementation to be like a master-slave thing, where the slave watchdog turns into a new master when it notices its master died (i.e. launching a new slave, etc).

    However, I can also see it as a logical opinion that many people underestimate the increased complexity of such a system, and the negative impact that can have on the stability of the watchdogs in the first place. Of course the only way to truly tell is to build both in a modular fashion and run the same site with half the server farm one way and half the other way… for.. um.. probably years? :) Anyone interested in building this and running it to see? … I thought so :)

  • Pingback: How to achieve zero downtime with Passenger 3? - Admins Goodies

  • facepalm

    > This works so well that we can restart the web server while running an ‘ab’ benchmark with 100 concurrent users without dropping a single request!

    That’s because `apachectl restart` doesn’t drop connection.

  • http://www.phusion.nl/ Hongli Lai

    Actually, `apachectl restart` *does* drop connections, while `apachectl graceful` doesn’t.