Phusion white papers Phusion overview

SHA-3 extensions for Ruby and Node.js

By Hongli Lai on October 6th, 2012

A few days ago, NIST announced the winner of the SHA-3 competition: Keccak (prounced [kɛtʃak], ketchak). The researchers who authored Keccak released a reference implementation in C.

We’ve created Ruby and Node.js extensions for Keccak. Our extensions utilize the code from the official reference implementation and come with a extensive suite of unit tests. But note however that I do not claim to be a security expert. Feel free to review the code for any flaws.

Install with:

gem install digest-sha3
npm install sha3

We’ve strived to emulate the languages’ standard hash libraries’ interfaces, so using these extensions is straightforward:

require 'digest/sha3'
Digest::SHA3.hexdigest("foo")

and

var SHA3 = require('sha3');
var hash = SHA3.SHA3Hash();
hash.update('foo');
hash.digest();

Both libraries are MIT licensed. Enjoy!

Why does the world need SHA-3?

If you’re not a security researcher then you’ve undoubtedly asked the same questions as I did. What’s wrong with SHA-1, SHA-256 and SHA-512? Why does the world need SHA-3? Why was Keccak the winner?

According to to well-known security researcher Bruce Schneier, there’s nothing wrong with the SHA-2 family of hashing functions. However he likes SHA-3 because it’s completely different. SHA-1 and SHA-2 are both based on the Merkle–Damgård construction. It may be feasible to find a flaw in SHA-1 that also affects SHA-2. In contrast, Keccak is based on the sponge construction so any attacks on SHA-1 and SHA-2 will probably not affect Keccak. Indeed, it appears that NIST chose Keccak because it was looking for some kind of insurance in case SHA-2 would be broken. Many people commented that they expected Skein to win.

Do not hash your passwords

In any case, the following cannot be repeated enough. Do not use SHA-3 (or SHA-256, SHA-512, RIPEMD-160 or whatever fast hash) to hash passwords! Do not even use SHA-3 + salt to hash passwords. Instead, use a slow hash like bcrypt. As Coda Hale explained in his article, you’ll want the hash to be slow so you can defend against attackers effectively.

The right way to deal with frozen processes on Unix

By Hongli Lai on September 21st, 2012


Those who administer production Unix systems have undoubtedly encountered the problem of frozen processes before. They just sit around, consuming CPU and/or memory indefinitely until you forcefully shut them down.

Phusion Passenger 3 – our high-performance and advanced web application server – was not completely immune to this problem either. That is until today, because we have just implemented a fix which will be part of Phusion Passenger 4.0 (4.0 hasn’t been released yet, but a first beta will appear soon). It’s going into the open source version, not just Phusion Passenger Enterprise, because we believe this is an important change that should benefit the entire community.

Behind our solution lies an array of knowledge about operating systems, process management and Unix. Today, I’d like to take this opportunity to share some of our knowledge with our readers. In this article we’re going to dive into the operating system-level details of frozen processes. Some of the questions answered by this article include:

  • What causes a process to become frozen and how can you debug them?
  • What facilities does Unix provide to control and to shut down processes?
  • How can a process manager (e.g. Phusion Passenger) automatically cleanup frozen processes?
  • Why was Phusion Passenger not immune to the problem of frozen processes?
  • How is Phusion Passenger’s frozen process killing fix implemented, and under what constraints does it work?

This article attempts to be generic, but will also provide Ruby-specific tips.

Not all frozen processes are made equal

Let me first clear some confusion. Frozen processes are sometimes also called zombie processes, but this is not formally correct. Formally, a zombie process as defined by Unix operating systems is a process that has already exited, but its parent process has not waited for its exit yet. Zombie processes show up in ps and on Linux they have “<defunct>” appended to their names. Zombie processes do not consume memory or CPU. Killing them – even with SIGKILL – has no effect. The only way to get rid of them is to either make the parent process call waitpid() on them, or by terminating the parent process. The latter causes the zombie process to be “adopted” by the init process (PID 1), which immediately calls waitpid() on any adopted processes. In any case, zombie processes are harmless.

In this article we’re only covering frozen processes, not zombie processes. A process is often considered frozen if it has stopped responding normally. They appear stuck inside something and are not throwing errors. Some of the general causes are:

  1. The process is stuck in an infinite loop. In practice we rarely see this kind of freeze.
  2. The process is very slow, even during shutdown, causing it to appear frozen. Some of the causes include:
    2.1. It is using too much memory, causing it to hit the swap. In this case you should notice the entire system becoming slow.
    2.2. It is performing an unoptimized operation that takes a long time. For example you may have code that iterates through all database table rows to perform a calculation. In development you only had a handful of rows so you never noticed this, but you forgot that in production you have millions of rows.
  3. The process is stuck in an I/O operation. It’s waiting for an external process or a server, be it the database, an HTTP API server, or whatever.

Debugging frozen processes

Obviously fixing a frozen process involves more than figuring out how to automatically kill it. Killing it just fixes the symptoms, and should be considered a final defense in the war against frozen processes. It’s better to figure out the root cause. We have a number of tools in our arsenal to find out what a frozen process is doing.

crash-watch

crash-watch is a tool that we’ve written to easily obtain process backtraces. Crash-watch can be instructed to dump backtraces when a given process crashes, or dump their current backtraces.

Crash-watch is actually a convenience wrapper around gdb, so you must install gdb first. It dumps C-level backtraces, not language-specific backtraces. If you run crash-watch on a Ruby program it will dump the Ruby interpreter’s C backtraces and not the Ruby code’s backtraces. It also dumps the backtraces of all threads, not just the active thread.

Invoke crash-watch as follows:

crash-watch --dump <PID>

Here is a sample output. This output is the result of invoking crash-watch on a simple “hello world” Rack program that simulates being frozen.

Crash-watch is especially useful for analyzing C-level problems. As you can see in the sample output, the program is frozen in a freeze_process call.

Crash-watch can also assist in analyzing problems caused by Ruby C extensions. Ruby’s mysql gem is quite notorious in this regard because it blocks all Ruby threads while it’s doing its work. If the MySQL server doesn’t respond (e.g. because of network problems, because the server is dead, or because the query is too heavy) then the mysql gem will freeze the entire process, making it even unable to respond to signals. With crash-watch you are able to clearly see that a process is frozen in a mysql gem call.

Phusion Passenger’s SIGQUIT handler

Ruby processes managed by Phusion Passenger respond to SIGQUIT by printing their backtraces to stderr. On Phusion Passenger, stderr is always redirected to the global web server error log (e.g. /var/log/apache2/error.log or /var/log/nginx/error.log), not the vhost-specific error log. If your Ruby interpreter supports it, Phusion Passenger will even print the backtraces of all Ruby threads, not just the active one. This latter feature requires either Ruby Enterprise Edition or Ruby 1.9.

Note that for this to work, the Ruby interpreter must be responding to signals. If the Ruby interpreter is frozen inside a C extension call (such as is the case in the sample program) then nothing will happen. In that case you should use crash-watch or the rb_backtrace() trick below.

rb_backtrace()

If you want to debug a Ruby program that’s not managed by Phusion Passenger then there’s another trick to obtain the Ruby backtrace. The Ruby interpreter has a nice function called rb_backtrace() which causes it to print its current Ruby-level backtrace to stdout. You can use gdb to force a process to call that function. This works even when the Ruby interpreter is stuck inside a C call. This method has two downsides:

  1. Its reliability depends on the state of the Ruby interpreter. You are forcing a call from arbitrary places in the code, so you risk corrupting the process state. Use with caution.
  2. It only prints the backtrace of the active Ruby thread. It’s not possible to print the backtraces of any other Ruby threads.

First, start gdb:

$ gdb

Then attach gdb to the process you want:

attach <PID>

This will probably print a whole bunch of messages. Ignore them; if gdb prints a lot of library names and then asks you whether to continue, answer Yes.

Now we get to the cream. Use the following command to force a call to rb_backtrace():

p (void) rb_backtrace()

You should now see a backtrace appearing in the process’s stdout:

from config.ru:5:in `freeze_process'
from config.ru:5
from /Users/hongli/Projects/passenger/lib/phusion_passenger/rack/thread_handler_extension.rb:67:in `call'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/rack/thread_handler_extension.rb:67:in `process_request'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler/thread_handler.rb:126:in `accept_and_process_next_request'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler/thread_handler.rb:100:in `main_loop'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/utils/robust_interruption.rb:82:in `disable_interruptions'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler/thread_handler.rb:98:in `main_loop'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler.rb:432:in `start_threads'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler.rb:426:in `initialize'
from /Users/hongli/Projects/passenger/lib/phusion_passenger/request_handler.rb:426

strace and dtruss

The strace tool (Linux) and the dtruss tool (OS X) can be used to see which system calls a process is calling. This is specially useful for detecting problems belonging to categories (1) and (2.2).

Invoke strace and dtruss as follows:

sudo strace -p <PID>
sudo dtruss -p <PID>

Phusion Passenger’s role

Phusion Passenger

Phusion Passenger was traditionally architected to trust application processes. That is, it assumed that if we tell an application process to start, it will start, and if we tell it to stop it will stop. In practice this is not always true. Applications and libraries contain bugs that can cause freezes, or maybe interaction with an external buggy component causes freezes (network problems, database server problems, etc). Our point of view was that the developer and system administrator should be responsible for these kind of problems. If the developer/administrator does not manually intervene, the system may remain unusable.

Starting from Phusion Passenger 3, we began turning away from this philosophy. Our core philosophy has always been that software should Just Work™ with the least amount of hassle, and that software should strive to be zero-maintenance. As a result, Phusion Passenger 3 introduced the Watchdog, Phusion Passenger Enterprise introduced request time limiting, and Phusion Passenger 4 will introduce application spawning time limiting. These things attack different aspects of the freezing process problem:

  1. Application spawning time limiting solves the problem of application processes not starting up quickly enough, or application processes freezing during startup. This feature will be included in the open source version of Phusion Passenger.
  2. Request time limiting solves the problem of application processes freezing during a web request. This feature is not available in the open source version of Phusion Passenger, and only in Phusion Passenger Enterprise.
  3. The Watchdog traditionally only solves the problem of Phusion Passenger helper processes (namely the HelperAgent and LoggingAgent) freezing during web server shutdown. Now, it also solves the problem of application processes freezing during web server shutdown. These features will too be included in the open source version of Phusion Passenger.

The shutdown procedure and the fix

When you shut down your web server, the Watchdog will be the one to notice this event. It will send a message to the LoggingAgent and the HelperAgent to tell them to gracefully shut down. In turn, the HelperAgent will tell all application processes to gracefully shut down. The HelperAgent does not wait until they’ve actually shut down. Instead it will just assume that they will eventually shut down. It is for this reason that even if you shut down your web server, application processes may stay behind.

The Watchdog was already designed to assume that agent processes could freeze during shutdown, so it gives agent processes a maximum amount of time during which they shut down gracefully. If they don’t, then the Watchdog will forcefully kill them with SIGKILL. It wouldn’t just kill the agent processes, but also all application processes.

The fix was therefore pretty straightforward. Always have the watchdog kill applications processes, even if the HelperAgent terminates normally and in time. The final fix was effectively 3 lines of code.

Utilizing Unix process groups

The most straightforward method to shutdown application processes would be to maintain a list of their PIDs and then killing them one-by-one. The Watchdog however uses a more powerful and little-used Unix mechanism, namely process groups. Let me first explain them.

A system can have multiple process groups. A process belongs to exactly one process group. Each process group has exactly 1 “process group leader”. The process group ID is equal to the PID of the group leader.

The process group that a process belongs to is inherited from its parent process upon forking. However a process can be a member of any process group, no matter what group the parent belongs to.


Same-colored processes denote processes belonging to the same process group. As you can see in the process tree on the right, process group membership is not constrained by parent-child relationships.

You can simulate the process tree on the right using the following Ruby code.

top = $$
puts "Top process PID: #{$$}"
Process.setpgrp
pid = fork do
  pid = $$
  pid2 = fork do
    pid2 = $$
    Process.setpgrp
    pid3 = fork do
      pid3 = $$
      sleep 0.1
      puts "#{top} belongs to process group: #{Process.getpgid(top)}"
      puts "#{pid} belongs to process group: #{Process.getpgid(pid)}"
      puts "#{pid2} belongs to process group: #{Process.getpgid(pid2)}"
      puts "#{pid3} belongs to process group: #{Process.getpgid(pid3)}"
    end

    # We change process group ID of pid3 after the fact!
    Process.setpgid(pid3, top)

    Process.waitpid(pid3)
  end
  Process.waitpid(pid2)
end
Process.waitpid(pid)

As you can see, you can change the process group membership of any process at any time, provided you have the permissions to do so.

The kill() system call provides a neat little feature: it allows you to send a signal to an entire process group! You can already guess where this is going.

Whenever the Watchdog spawns an agent process, it creates a new process group for it. The HelperAgent and the LoggingAgent are both process group leaders of their own process group. Since the HelperAgent is responsible for spawning application processes, all application processes automatically belong to the HelperAgent’s process group, unless the application processes themselves change this.

Upon shutdown, the Watchdog sends SIGKILL to HelperAgent’s and the LoggingAgent’s process groups. This ensures that everything is killed, including all application processes and even whatever subprocesses they spawn.

Conclusion

In this article we’ve considered several reasons why processes may freeze and how to debug them. We’ve seen which mechanisms Phusion Passenger provides to combat frozen processes. We’ve introduced the reader to Unix process groups and how they interact with Unix signal handling.

A lot of engineering has gone into Phusion Passenger to ensure that everything works correctly even in the face of failing software components. As you can see, Phusion Passenger uses proven and rock-solid Unix technologies and concepts to implement its advanced features.

Support us by buying Phusion Passenger Enterprise

It has been a little over a month now since we’ve released Phusion Passenger Enterprise, which comes with a wide array of additional features not found in the open source Phusion Passenger. However as we’ve stated before, we remain committed to maintaining and developing the open source version. This is made possible by the support of our customers; your support. If you like Phusion Passenger or if you want to enjoy the premium Enterprise features, please consider buying one or more Phusion Passenger Enterprise licenses. Development of the open source Phusion Passenger is directly funded by Phusion Passenger Enterprise sales.

Let me know your thoughts about this article in the comments section and thank you for your support!

Why Rails 4 Live Streaming is a big deal

By Hongli Lai on August 3rd, 2012

TLDR: Rails Live Streaming allows Rails to compete with Node.js in the streaming arena. Streaming requires application servers to support either multi-threaded or evented I/O. Most Ruby application servers are not up for the job. Phusion Passenger Enterprise 4.0 (a Ruby app server) is to become hybrid multi-processed, multi-threaded and evented. This allows seamless support for streaming, provides excellent backwards compatibility, and allows future support for more use cases than streaming alone.

Several days ago Rails introduced Live Streaming: the ability to send partial responses to the client immediately. This is a big deal because it opens up a huge number of use cases that Rails simply wasn’t suitable for. Rails was and still is an excellent choice for “traditional” web apps where the user sends a request and expects a full response back. It was a bad choice for anything that works with response streams, for example:

  • Progress responses that continuously inform the user about the progress. Imagine a web application that performs heavy calculations that can take several minutes. Before Live Streaming, you had to split this system up into multiple pages that must respond immediately. The main page would offload the actual work into a background worker, and return a response informing the user that the work is now in progress. The user must poll a status page at a regular interval to lookup the progress of the work. With Live Streaming, you can not only simplify the code by streaming progress information in a single request, but also push progress information to the user much more quickly and without polling:
    def big_work
      work = WorkModel.new
      while !work.done?
        work.do_some_calculations
        response.stream.write "Progress: #{work.progress}%\n"
      end
      response.stream.close
    end
    
  • Chat servers. Or, more generally, web apps that involve a large number of mostly idle but persistent connections. Until today this has largely been the domain of evented systems such as Node.js and Erlang.

And as Aaron Patterson has already explained, this feature is different from Rails 3.2’s template streaming.

Just “possible” is not enough

The same functionality was actually already technically possible in Ruby. According to the Rack spec, Rack app objects must return a tuple:

[status_code, headers, body]

Here, body must respond to the each method. You can implement live streaming by yourself, with raw Rack, by returning a body object that yields partial responses in its each method.

class StreamingBody
  def each
    work = WorkModel.new
    while !work.done?
      work.do_some_calculations
      yield "Progress: #{work.progress}%\n"
    end
  end
end

Notice that the syntax is nearly identical to the Rails controller example code. With this, it is possible to implement anything.

However streaming in Ruby has never caught a lot of traction compared to systems such as Node.js. The latter is much more popular for these kind of use cases. I believe this inequality in populairty is caused by a few things:

  1. Awareness. Not everybody knew this was possible in Ruby. Indeed, it is not widely documented.
  2. Ease and support. Some realize this is possible, but chose not to use Ruby because many frameworks do not provide easy support for streaming. It was possible to stream responses in pre-4.0 Rails but the framework code generally does not take streaming into account, so if you try to do anything fancy you run the risk of breaking things.

With Live Streaming, streaming is now easy to use as well as officially supported.

Can Rails compete with Node.js?

Node.js is gaining more and more momentum nowadays. As I see it there are several reasons for this:

  1. Love for JavaScript. Some developers prefer JavaScript over Ruby, for whatever reasons. Some like the idea of using the same language for both frontend and backend (although whether code can be easily shared between frontend and backend remains a controversial topic among developers). Others like the V8 engine for its speed. Indeed, V8 is a very well-optimized engine, much more so than Ruby 1.9’s YARV engine.
  2. Excellent support for high I/O concurrency use cases. Node.js is an evented I/O system, and evented systems can handle a massive amount of concurrent connections. All libraries in the Node.js ecosystem are designed for evented use cases, because there’s no other choice. In other languages you have to specifically look for evented libraries, so the signal-to-noise ratio is much lower.

I have to be careful here: the phrases “high I/O concurrency” and “massive ammount of concurrent connections” deserve more explanation because it’s easy to confuse them with “uber fast” or “massively scalable”. That is not what I meant. What I meant is, a single Node.js process is capable of handling a lot of client sockets, assuming that any work you perform does not saturate memory, CPU or bandwidth. In contrast, Ruby systems traditionally could only handle 1 concurrent request per process, even you don’t do much work inside a request. We call this a multi-process I/O model because the amount of concurrent users (I/O) the system can handle scales only with the number of processes.

In traditional web apps that send back full responses, this is not a problem because the web server queues all requests, the processes respond as quickly as possible (usually saturating the CPU) and the web server buffers all responses and relieves the processes immediately. In streaming use cases, you have long-running requests so the aforementioned mechanism of letting the web server buffer responses is simply not going to work. You need more I/O concurrency: either you must have more processes, or processes must be able to handle more than 1 request simultaneously. Node.js processes can effectively handle an unlimited number of requests simultaneously, when not considering any constraints posed by CPU, memory or bandwidth.

Node.js is more than HTTP. It allows arbitrary networking with TCP and UDP. Rails is pretty much only for HTTP and even support for WebSockets is dubious, even in raw Rack. It cannot (and I believe, should not) compete with Node.js on everything, but still… Now suddenly, Rails can compete with Node.js for a large number of use cases.

Two sides of the coin

Reality is actually a bit more complicated than this. Although Rails can handle streaming responses now, not all Ruby application servers can. Ilya Grigorik described this problem in his article Rails Performance Needs an Overhaul and criticized Phusion Passenger, Mongrel and Unicorn for being purely multi-process, and thus not able to support high concurrency I/O use cases. (Side note: I believe the article’s title was poorly chosen; it criticizes I/O concurrency support, not performance.)

Mongrel’s current maintenance status appears to be in limbo. Unicorn is well-maintained, but its author Eric Wong has explicitly stated in his philosophy that Unicorn is to remain a purely multi-processed application server, with no intention to ever become multithreaded or evented. Unicorn is explicitly designed to handle fast responses only (so no streaming responses).

At the time Ilya Grigorik’s article was written, Thin was the only application server that was able to support high I/O concurrency use cases. Built on EventMachine, Thin is evented, just like Node.js. Since then, another evented application server called Goliath has appeared, also built on EventMachine. However, evented servers require evented application code, and Rails is clearly not evented.

There have been attempts to make serial-looking code evented through the use of Ruby 1.9 fibers, e.g. through the em-synchrony gem, but in my opinion fibers cause more problems than they solve. Ruby 1.8’s green threading model was essentially already like fibers: there was only one OS thread, and the Ruby green thread scheduler switches context upon encountering a blocking I/O operation. Fibers also operate within a single OS thread, but you can only context switch with explicit calls. In other words, you have to go through each and every blocking I/O operation you perform and insert fiber context switching logic, which Ruby 1.8 already did for you. Worse, fibers give the illusion of thread safetiness, while in reality you can run into the same concurrency problems as with threading. But this time, you cannot easily apply locks to prevent unwanted context switching. Unless the entire ecosystem is designed around fibers, I believe evented servers + fibers only remains useful for a small number of use cases where you have tight control over the application code environment.

There is another way to support high I/O concurrency though: multi-threading, with 1 thread per connection. Multi-threaded systems generally do not support as much concurrent I/O as evented system, but are still quite formidable. Multi-threaded systems are limited by things such as the thread stack size, the available virtual memory address space and the quality of the kernel scheduler. But with the right tweaking they can approach the scalability of evented systems.

And so this leaves multithreaded servers as the only serious options for handling streaming support in Rails apps. It’s very easy to make Rails and most other apps work on them. Puma has recently appeared as a server in this category. Like most other Ruby application servers, you have to start Puma at least once for every web app, and each Puma instance is to be attached to a frontend web server in a reverse proxy setup. Because Ruby 1.9 has a Global Interpreter Lock, you should start more than 1 Puma process if you want to take advantage of multiple cores. Or you can use Rubinius, which does not have a Global Interpreter Lock.

And what about Phusion Passenger?

Towards a hybrid multi-processed, multi-threaded and evented application server

To recap, each I/O model – multi-process, multi-threaded, evented – has its own pros and drawbacks:

  • Multi-process
    • Pros:
      • Excellent application compatibility.
      • Lack of threading avoids concurrency bugs (e.g. race conditions) created by threading.
      • Simple and easy to understand. If one process crashes, the others are not affected.
      • Can utilize multiple cores.
    • Cons:
      • Supports very low I/O concurrency.
      • Uses a lot of memory.
  • Multi-threaded
    • Considerations:
      • Not as compatible as multi-process, although still quite good. Many libraries and frameworks support threaded environments these days. In web apps, it’s generally not too hard to make your own code thread-safe because web apps tend to be inherently embarrassingly parallel.
      • Can normally utilize multiple cores in a single process, but not in MRI Ruby. You can get around this by using JRuby or Rubinius.
    • Pros:
      • Supports high I/O concurrency.
      • Threads use less memory than processes.
    • Cons:
      • If a thread crashes, the entire process goes down.
      • Good luck debugging concurrency bugs.
  • Evented
    • Pros:
      • Extremely high I/O concurrency.
      • Uses even less memory than threads.
    • Cons:
      • Bad application compatibility. Most libraries are not designed for evented systems at all. Your application itself has to be aware of events for this to work properly.
      • If your app/libraries are evented, then you can still run into concurrency bugs like race conditions. It’s easier to avoid them in an evented system than in a threaded system, but when they do occur they are very difficult to debug.
      • Cannot utilize multiple cores in a single process.

As mentioned before, Phusion Passenger is currently a purely multi-processed application server. If we want to change its I/O model, which one should we choose? We believe the best answer is: all of them. We can give users a choice, and let them chose – on a per-application basis – which I/O model they want.

Phusion Passenger Enterprise 4.x (which we introduced earlier) is to become a hybrid multi-processed, multi-threaded and evented application server. You can choose with a single configuration option whether you want to stay with the traditional multi-processed I/O model, whether you want multiple threads in a single process, or whether you want processes to be evented. In the latter two cases, you even control how many processes you want, in order to take advantage of multiple cores and for resistance against crashes. We believe a combination of processes and threads/events are best.

Being a hybrid server with configurable I/O model allows Phusion Passenger to support more than just streaming. Suddenly, the possibilities become endless. We could for example support arbitrary TCP protocols in the future with no limits on traffic workloads.

Code has just landed in the Phusion Passenger Enterprise 4.0 branch to support multithreading. Note that the current Phusion Passenger Enterprise release is of the 3.0.x series and does not support this yet. As you can see in our roadmap, Phusion Passenger Enterprise 4.0 beta will follow 3.0.x very soon.

Subscribe to the Phusion newsletter and we’ll keep you up to date about the latest Phusion Passenger features, licensing updates and more.



Phusion welcomes Luuk Hendriks

By Luuk Hendriks on July 10th, 2012

For the past few weeks, Luuk Hendriks has been working with us as a system administrator. Having worked with him before on a shared project, we are confident that he will be a competent and valueable addition to our team. It is with much delight that we welcome him. In this blog post he will introduce himself to the world.

Enter neckbeard

For a couple weeks now I, Luuk, am working with the guys at Phusion. A while ago I met Hongli and Ninh at the University of Twente during a project that involved a lot of coding, and the vibe was good. They could use some help at their operations and system administrations at Phusion, so I stopped shaving to gain some nerd cred. From my work with Ruby and Rails it was no surprise I already knew them: Ruby Enterprise Edition, Phusion Passenger, those are things you won’t easily miss. Working with these guys sounded like a great oppurtunity, so here I am.

In short, I’m a Telematics masters student at the University of Twente. My main interests in this area are highly available, scalable systems in networking and the web, and everything that is needed to realise them. Languages I like to hack in include Ruby and Haskell, although I sometimes try to find inner peace on lower levels doing some Z80 assembly on an old ZX Spectrum 48k (with varying success, to be very honest, but how bad-ass is that old nifty machine). As caffeine is part of most nerds’ daily toolkit, I feel obliged to show my weapon of choice:

Moka

Pure black, of course, with Arch Linux, AwesomeWM and zsh on the side completing my geek palate.

Besides my nerdy interests I can spend all my free time on collecting and listening to vinyl records (especially psych/kraut/free-jazz stuff), tinker with hifi equipment (again, with varying success, but that’s part of the charm right?), and playing the guitar. If ever Phusion will stop its current operations and become an 80’s glamrock band, I’m ready.

But for now, Phusion will continue to do what you know them for, and more. The soon to be released Union Station is one of the exciting things coming up, and it’s one of the projects I’m currently working on. When this comes out of Beta, we need a nice production cluster to support it, and we want to be able to replace or add servers without any significant downtime. This is where I come in, using the well-known Chef framework. It is a great way to get started on a project you are new to: for the new guy (that would be me), it is a great hands-on experience with all the aspects of the project, in both the software and the hardware supporting it. Writing Chef recipes, all of these parts will show up sooner or later, so it is a nice way to get familiar with a running project. But it is also a fresh look, a new perspective in an existing, maybe few years old, creation. Bi-winning.

Furthermore a new flavour of Phusion Passenger will be available shortly: Phusion Passenger Enterprise. Details will follow in the near future, but it’s going to come with many features that people have long asked for. It’s going to be released on July 24. Currently I’m stress testing an experimental branch, and things look promising. Be sure to keep an eye out for this one if you like your web apps served hot ‘n steamy.

All in all, there are a lot of exciting things at Phusion to keep me busy, with a nice and skilled team of people to support and guide me. Let’s see where things go from here. Be sure to get your BubbleConf tickets while the early bird reduction is still available, and meet me at the beautiful Tsuschinski in October. Just spot the neckbeard.

XCode 4 ships a buggy compiler

By Hongli Lai on December 30th, 2011

You may have heard of LLVM, a compiler infrastructure library. You may have heard of GCC, the GNU C and C++ compiler. Those two are completely separate software products, but there exists llvm-gcc which is a GCC frontend that utilizes LLVM. All OS X versions <= Snow Leopard originally shipped with regular gcc, but as of Xcode 4 (which is also the default on OS X Lion) Apple has switched to llvm-gcc as their default compiler. “Hurray”, some people may think, as they heard that LLVM generates better code and/or is faster than default GCC. Unfortunately, the reality is not that great.

llvm-gcc is unmaintained and is considered deprecated by its developers. It has many bugs that will never be fixed. Unfortunately these are not obscure bugs that few people will notice: today, while working on Ruby Enterprise Edition, we encountered several! Calling alloca(0) should return a pointer to the top of the stack. On llvm-gcc however, that code returns NULL but only when compiled with -O2! llvm-gcc also generates subtly different code that causes the MBARI patch set to fail completely.

To make matters worse, consider this program:

#include <alloca.h>

int
main() {
    if (alloca(0) != NULL) {
        return 0;
    } else {
        return 1;
    }
}

Question: when compiled with “llvm-gcc -O2″, does this program exit with 0 or with 1?

That was a trick question. It always returns with 0, even when alloca(0) is broken and actually returns NULL. Turns out the optimizer in -O2 thinks that alloca(0) doesn’t return NULL (even though it does) and replaces if (alloca(0) != NULL) with if (true).

It is unfortunate that we have to declare OS X’s default compiler as being broken. We do not recommend its use. Instead, we recommend people to use /usr/bin/gcc-4.2 (which is the non-LLVM GCC compiler) instead of /usr/bin/gcc (which is a symlink to /usr/bin/llvm-gcc). You can enable this on most build systems by setting these two environment variables:

export CC=gcc-4.2
export CXX=g++-4.2

On a side note, Ruby Enterprise Edition 2011.12 is being released. The official announcement hasn’t been written yet, but we’ve made haste because of the recent Ruby security vulnerability. You can already get the files on the Ruby Enterprise Edition website. An announcement will follow soon.

Update

some people have pointed out that they consider alloca(0) undefined. I guess it depends on the way you interpret the documentation. It would seem pretty clear to me what alloca(0) means, just like what memcpy(x, y, 0) means, but it’s understandable when some people would interpret it as undefined.
Still, alloca(0) failing is only one of the problems that prevented Ruby from compiling properly.

Why alloca(0)?

alloca(0) is used in Ruby Enterprise Edition’s conservative garbage collector, as a way to detect the end of the stack. There’s a bunch of fallback #ifdefs in the code: on platforms that don’t have alloca, it detects the end of the stack by calling a forced-non-inline function which returns the address of its sole local variable, but that assumes that the compiler supports the ‘noinline’ keyword. In any case, all of versions depend on highly platform-specific behavior.

Rendering Rails 3.1 assets to string

By Hongli Lai on August 14th, 2011

The upcoming Rails 3.1 will come with a powerful asset pipeline, which is a framework for allowing developers to preprocess, concatenate, minify and compress Javascripts and CSS files. Javascripts can be written in CofeeScript, which is compiled on-the-fly to Javascript. Similarly, CSS can be written in Sass or SCSS, which are compiled on-the-fly to plain old CSS. Many people have already written about this topic.

Today we ran into a situation in which we wanted to render an SCSS file to a string so that we can put it in a JSON document. The way to do this is non-obvious. First try:

render :template => 'assets/stylesheets/api.css'

Rails complained that it cannot find this file in its search path, which only includes ‘app/views’.

Next try involved using the :file option because it also searches in Rails.root:

render :file => 'app/assets/stylesheets/api.css'

This time it successfully found the file, but couldn’t render it because there’s no :scss template engine. Apparently the asset pipeline does not integrate into the Rails template engine framework.

After some code digging, it turned out that asset serving is completely powered by Sprockets. The Rack middleware responsible for serving the /assets is Sprockets::Server, which looks up asset information using a Sprockets::Environment object. This object also handles rendering. The Rails integration allows such an object to be accessed through the YourApp::Application.assets method.

So the correct way to render an asset to a string is as follows:

YourApp::Application.assets.find_asset('api.css').body

Here, YourApp is your application’s module name as found in config/application.rb.

Launching unofficial, automatically updated Github mirror for the Nginx SVN repository

By Hongli Lai on August 9th, 2011

I probably don’t need to tell you what a great web server Nginx is. It’s lightweight, it’s fast, it’s scalable, and many people like it for its configuration file syntax yet flexible features. It’s no surprise that Nginx is doing a great job serving as a core for Phusion Passenger Standalone (our Ruby/Rails web application server which supports Apache and Nginx).

Nginx’s development has traditionally been behind closed doors. Nginx is written for the most part by one brilliant man, Igor Sysoev, who accepts patches on the Nginx mailing list. But for a long time there was no source repository with which contributors and interested people can track the development process, and no official bug tracker. All of that have changed in 2011. Igor has established Nginx as a company. The Nginx SVN repository is now open to the public and a few days ago they even opened a Trac. In short: a lot of great news lately.

Announcing the Github Nginx mirror

These days a lot of people have switched from Subversion to Git. When you’ve worked with Git for a while, Subversion probably feels archaic and unproductive. The “killer app” for Git is Github, which provides an unbeatable collaboration experience. Indeed, many projects that have switched to Github reported a dramatic increase in contributions!

In order to stimulate Nginx development, we’ve launched a Github mirror of the Nginx SVN repository:

A few notes about this mirror:

  • It’s automatically updated twice a day.
  • It’s read-only. We don’t accept pull requests. Changes should be sent to the Nginx mailing list in the form of patches.
  • We intentionally don’t mirror all the branches and tags, only the most recent ones, because we don’t believe the old branches and tags are useful to anybody.
  • Please feel free to contact us if you have an issue with our mirror.

Happy developing!

Efficient substring searching

By Hongli Lai on December 6th, 2010

There are many times when programmers need to search for a substring, for example when parsing text. This is commonly referred to as searching for a needle (substring) in a haystack (the string to search in). The most straightforward way to do this is by using search functions that your language provides:

  • C: strchr()/memchr(), strstr()/memmem()
  • C++: string::find()
  • Ruby: String#index or regular expressions
  • Python: string.find() or regular expressions

However those functions are usually implemented in a naive way. They usually go through every index in the haystack and try to compare the substring at that index with the given needle. Thus, given a needle of length M and a haystack of length N, those algorithms need O(M*N) steps in the worst case. This is acceptable for small haystacks, but becomes more and more problematic as the the haystack grows.

In this article we’ll examine smarter algorithms, in particular Boyer-Moore and its variants. Finally we’ll present the reader with an optimized Boyer-Moore-Horspool implementation in C++. This implementation differs from most others in that it’s heavily optimized and is capable of accepting haystack input in a streaming manner.

Before we move on, it should be noted that Python’s string.find(), Ruby regular expressions and glibc’s implementation of memmem() actually can use smarter algorithms when conditions are right, but that is besides the main point of this article.


Demonstration of a naive substring search algorithm

Smarter algorithms

Are there faster algorithms for searching needles? Why yes, computer science has already found the answer decades ago. Here are two algorithms that can find needles in less than O(M*N) time:

The ideas behind these algorithms are that, when a mismatch occurs during an iteration, you can sometimes intelligently skip several characters or you can start the next substring match at a higher position in order to avoid re-matching characters unnecessarily. Both algorithms are characterized by the need to precompute some preparation data. It is this data that allows them to search for needles in a way faster than O(M*N).

Knuth-Morris-Pratt simulates how a finite state automata accepts input symbols. During its precomputation phase it builds this automata, which requires O(M) time. Its main search algorithm, which runs this automata, has a worst case time complexity of O(N).

Boyer-Moore is similar, but in its precomputation phase it needs to build two tables, a so-called good suffix table and a bad suffix table. These tables are built from the needle contents. Its main search algorithm requires 3*N steps in the worst case, thus having a worst case time complexity of O(N).

More on Boyer-Moore

Boyer-Moore is a very peculiar algorithm because a longer needle makes it faster! This counter-intuitive property is caused by the fact Boyer-Moore matches the last needle character first and matches from right to left. If the last needle character doesn’t match the haystack then it knows it can skip M characters. When one of the other needle characters mismatch as well, it can usually skip more than 1 character based on the character which mismatched; this information is retrieved from the tables that had to be built in the precomputation phase. Thus, Boyer-Moore usually finishes in sublinear time.

Boyer-Moore has several variants.


“Tuning the Boyer-Moore-Horspool String
Searching Algorithm” by Timo Raita, 1992
  • Turbo Boyer-Moore takes more space but needs 2*N steps in the worst case instead of the original’s 3*N steps.
  • Boyer-Moore-Horspool uses less space (only requires the good suffix table) and has a simpler inner loop with less constant overhead per iteration. Its average performance is O(N) but it has a worst-case performance of O(M*N).
  • Boyer-Moore-Horspool itself also has variants. In Timo Raita’s 1992 paper “Tuning the Boyer-Moore-Horspool String Searching Algorithm” he asserts that text is rarely random and that there usually exist strong dependencies between characters. Thus there are smarter ways to match the needle: instead of matching it backwards from the last character, one should first match the last character, then the first character, then the middle one, etc. The advantage of this heuristic becomes obvious if we consider an example in which we search for the needle “hello world” in a haystack that often contains the word “world” in combination with a word other than “hello”, e.g. “beautiful world”, “my world”, “another world”. When the algorithm matches the last needle character “d”, instead of wasting time matching “worl” as well, it can match the first needle character “h” and immediately determine that there is no match. This heuristic can drastically improve Boyer-Moore-Horspool’s average performance.

Theory vs practice

Which algorithm should one use, Boyer-Moore, Turbo Boyer-Moore or Boyer-Moore-Horspool? In theory Turbo Boyer-Moore is an attractive alternative because of its worst-case performance, but real-world experience has shown that theory is rarely the same as practice. Clearly, this calls for benchmarks.

Various implementations of Boyer-Moore (and variants) exist in a variety of languages, but for the purpose of benchmarking we will want to use C or C++ because they allow the most optimal implementations. (Or assembly, but let’s not be too l33t today.) However there seem to be few good C/C++ implementations out there that are:

  1. Readable.
  2. Correct.
  3. Tested.
  4. Optimal.

After much digging I’ve found one C++ implementation that most matches our requirements. Its author, Joel Yliluoma, has agreed to open source his code under the MIT license, which I’m very grateful for. He had implemented Boyer-Moore, Turbo Boyer-Moore and Boyer-Moore-Horspool with some of the ideas by Timo Raita 1992. We’ve extended his work by adding unit tests, writing documentation and adding further optimizations. Using this extended code we’ve set up a benchmark suite. The results are as follows.

The benchmark is conducted on a MacBook Pro with OS X Snow Leopard and Intel Core 2 Duo 2.4 Ghz. For comparison reasons we’ve included benchmark results for strstr() and memmem() as well. strstr() is omitted in benchmark 1 because it cannot handle arbitrary random data. We’ve implemented memmem() ourselves because memmem() is a GNU extension and doesn’t exist on OS X. The test program is compiled with -O2 optimizations.

Benchmark 1: random data

This benchmark searches for the needle “I have control\n” in a 200 MB file containing random binary data, 10 iterations.

Boyer-Moore 1191 ms
Boyer-Moore-Horspool (Joel’s original implementation) 787 ms
Boyer-Moore-Horspool (our implementation) 729 ms
Turbo Boyer-Moore 1745 ms
memmem 2956 ms

As expected, memmem() is the slowest of the bunch, being more than twice as slow as Boyer-Moore, but actually doesn’t perform *that* badly.
Turbo Boyer-Moore, despite its name, performs poorly compared to the original.
Boyer-Moore-Horspool is significantly faster than everything else. Our optimized implementation is slightly faster than Joel’s original.

Benchmark 2: Alice in Wonderland

This benchmark searches for the needle “I have control\n” in a special compilation of Alice in Wonderland. We took the HTML version from Project Gutenberg and multiplied its contents many times in order to generate a 200 MB Alice in Wonderland HTML file. This benchmark ran for 10 iterations. This is probably a much better real-world test than the random binary blob in benchmark 1.

Boyer-Moore 1734 ms
Boyer-Moore-Horspool (Joel’s original implementation) 1101 ms
Boyer-Moore-Horspool (our implementation) 1080 ms
Turbo Boyer-Moore 2683 ms
strstr 2116 ms
memmem 3047 ms

The results are very similar to those of benchmark 1. Again, strstr() and memmem() are the slowest of the bunch, and our implementation of Boyer-Moore-Horspool the fastest.

Benchmark 3: triggering bad performance

In this benchmark we attempt to trigger bad peformance in Boyer-Moore-Horspool. We created a 200 MB file containing nothing but newlines, in which we look for the needle “I have control\n\n”. Notice the extra newline at the end of the needle! This causes the algorithm to skip slowly. The benchmark ran for 10 iterations.

Boyer-Moore 1515 ms
Boyer-Moore-Horspool (Joel’s original implementation) 12425 ms
Boyer-Moore-Horspool (our implementation) 9098 ms
Turbo Boyer-Moore 2281 ms
strstr 1820 ms
memmem 2744 ms

Here, Boyer-Moore-Horspool is significantly worse than even memmem() despite the same theoretical worst-case complexity! This is because memmem() has a much simpler inner loop with lower constant overhead. We also see that our optimizations to Boyer-Moore-Horspool really help.

Conclusion

Different substring searching algorithms have significantly different performance characteristics. For small haystacks it doesn’t really matter which one you use, but for larger ones you should definitely consider smarter algorithms like Boyer-Moore. Using these smarter algorithms is not without cost however: because they require preparation (creating an FSA or building tables) you must ask yourself whether the cost of preparation is acceptable for your use cases.

Turbo Boyer-Moore is disappointing, its name doesn’t do it justice. In academia constant overhead doesn’t matter, but here we see that it matters a lot in practice. Turbo Boyer-Moore’s inner loop is so complex that we think we’re better off using the original Boyer-Moore.

We think that for most cases Boyer-Moore-Horspool is the best choice. Despite its theoretical worst-case performance, its worst case is seldomly triggered; we had to spend effort on generating a bad input file and picking a bad needle on purpose. Its inner loop is simple and easy to understand. Its simplicity also results in better CPU cache utilization and branching behavior. On average Boyer-Moore-Horspool performs so well that for us, the choice is clear. Boyer-Moore-Horspool is an excellent choice for things like parsing protocol headers and multipart MIME data. In case of multipart MIME data, most browsers generate fairly long boundary strings, allowing Boyer-Moore-Horspool to skip over non-boundary data very quickly.

The code

Most string search algorithm implementations require the entire haystack data to be in memory. In contrast, we’ve written a “streaming” Boyer-Moore-Horspool implementation which allows one to feed the haystack data piece-of-piece. This is especially useful when parsing large amounts of protocol data received over the network without keeping all data in memory.

Our implementation is heavily optimized for both memory usage and CPU utilization as well as reusability. During the development of this implementation we’ve applied the highest degree of C++ kung-fu in order to get every last bit of efficiency out of the computer.

  • It’s reentrant. No global variables. Instead the algorithm accepts a needle-specific context structure.
  • All data structures are optimized for size and locality of reference. Smaller size means less memory usage, which in turn means less CPU cache usage. On modern systems CPU cache utilization is one of the most important performance factors.
  • The context structure can be allocated in a single memory allocation, irregardless of the length of the needle. This minimizes memory allocation overhead, both in time (e.g. by allocating on the stack) and in space (by avoiding malloc() bookkeeping overhead). The algorithm does not need any more memory allocations than this initial one. The code also does not dictate a specific way to allocate context structure; the user is free to choose where to allocate it.
  • We’ve also applied some of the ideas by Timo Raita 1992, but our implementation differs from Joel’s original implementation through minimizing function call overhead. Joel’s implementation calls memcmp(); ours first compares the first character using inline code.
  • Our code contains various CPU branch prediction hints in order to minimize branching overhead.
  • We make use of the ‘restrict’ keyword in order to allow better compiler optimizations.

Making Ruby threadable: properly handling context switching in native extensions

By Hongli Lai on June 10th, 2010

In the previous article Does Rails Performance Need an Overhaul? we had discussed the fact that proper Ruby threading is hindered by various broken native extensions. Writing a native extension for Ruby is pretty easy, however writing it right can not only be difficult, but can also be an obscure practice that requires l33t sk1llz because of the lack of documentation in this area. We’ve written several native extensions so far and in the process of figuring out how to make threading-friendly native extensions we had to wade through tons of Ruby source code. In this article I want to teach some best practices in the writing of threading-friendly native extensions.

Threading basics

As discussed in the previous article, Ruby 1.8 implements userspace threads, meaning that no matter how many Ruby threads you have, only one can run at a time, and only on a single CPU core. The threads are scheduled by Ruby itself, not by the operating system.

Ruby 1.9 implements native operating system threads. However it has a global interpreter lock which must be locked when the thread is running Ruby code. This effectively makes Ruby 1.9 single threaded most of the time.

With both Ruby 1.8 and 1.9 threads, system calls such as I/O operations can block the thread and preventing Ruby from context switching to another. Thus, system calls require special attention. Expensive calculations that do not involve system calls can also block the thread, but something can be done those as well, as you will read later on.

Handling I/O

Suppose that you have a file descriptor on which you want to perform some potentially blocking I/O. The naive approach is to perform the I/O command anyway and risk blocking the entire Ruby process. This is exactly what makes the mysql extension thread-unfriendly: while waiting on MySQL no other threads can run, grinding your multi-threaded Rails web app to a halt.

However there are a number of functions in your arsenal that you can use to combat this problem. And as a general rule, you should set to file descriptors to non-blocking mode.

rb_thread_wait_fd(fd)

Just before performing a blocking read, you should call rb_thread_wait_fd() on the file descriptor that you’re reading from. On 1.8, this function marks the current thread as waiting for readable data on this file descriptor and then invokes the scheduler. The scheduler uses the select() system call to check which file descriptors are readable and then selects a thread which may continue. If the file descriptor that you were waiting on is not readable, then your thread will be suspended until the next time the scheduler is invoked and selects your thread. But even if the file descriptor is immediately readable, the scheduler does not guarantee that your thread will be selected immediately.

On 1.9, rb_thread_wait_fd() simply unlocks the global interpreter lock, calls the select() system call on the given file descriptor, and re-acquires the global interpreter lock when select() returns. While select() is blocking, other threads can run.

As an optimization, if only the main thread exists then this function does nothing. This applies to both 1.8 and 1.9.

rb_thread_fd_writable(fd)

This works the same as rb_thread_fd(), but waits until the given file descriptor becomes writable. The single-thread optimization applies here too. You should call rb_thread_fd_writable() just before you perform a write I/O operation.

rb_thread_select()

To wait on multiple file descriptors, use this function instead of select() or poll(). Unlike the native system calls, this function will take care of invoking the scheduler or unlocking the global interpreter lock. Unlike rb_thread_wait_fd() and rb_thread_fd_writable(), there is no do-nothing-when-there’s-only-one-thread optimization here so it will always invoke the scheduler and call select().

rb_io_wait_readable()

I/O system calls can return a variety of error codes that indicate that you should restart the system call, such as EINTR (system call interrupted by signal) and EAGAIN (the file descriptor is set to non-blocking mode and the data is not yet available). You should therefore always call I/O system calls in a loop until it returns success or a different error code. You must however not forget to call rb_thread_wait_fd() or rb_thread_select() before you restart the system call, or you will risk blocking the thread again.

Ruby provides a function rb_io_wait_readable() to aid you in writing restart code. This function should be called right after your I/O reading system call has returned. It checks whether the system call should be restarted (returning Qtrue) or whether you should report an error (returning Qfalse). Here’s a code example:

int done = 0;
int ret;

/* Have the Ruby scheduler suspend this thread until the file descriptor becomes
 * readable; or if this is the only thread in the system, rb_thread_wait_fd() does
 * nothing and we immediately continue to the 'do' loop.
 */
rb_thread_wait_fd(fd);
do {
    /* Actually you should surround your system call with some more code, but
     * we'll get to this later. This example code is only partial. */
    ret = ...your read system call here...
    if (ret == -1) {
        if (rb_io_wait_readable(fd) == Qfalse) {
            ...throw an exception here...
        } /* else restart loop */
    } else {
        done = 1;
    }
} while (!done);

rb_io_wait_readable() checks whether errno equals EINTR or ERESTART, in which case it will call rb_thread_wait_for() on the file descriptor and return Qtrue. If errno is EAGAIN or EWOULDBLOCK then it calls rb_thread_select() on the file descriptor and returns true. Otherwise it returns false.

The difference between calling rb_thread_wait_for() and rb_thread_select() here is subtle, but important. The former only blocks (calls select() on the file descriptor) when there are multiple Ruby threads in the Ruby process, while the latter always blocks no matter what. This behavior is important because EAGAIN and EWOULDBLOCK occur when a non-blocking file descriptor is not yet readable; if we don’t block here on a select() then the code will enter a 100% CPU busy loop.

rb_io_wait_writable()

Works the same way as rb_io_wait_readable(). Use this for I/O write operations instead.

Sleeping

Use rb_thread_wait_for() instead of sleep() or usleep(). On 1.8 rb_thread_wait_for() marks the current thread as sleeping for a period of time and then invokes the scheduler, which does not select this thread until the period of time has expired. On 1.9 Ruby unlocks the global interpreter lock, calls some sleeping function, and then re-locks it after that function returns.

Other non-I/O blocking system calls

Sometimes you will want to wait on a blocking system call that isn’t related to I/O, such as waitpid(). There are several ways to deal with these kind of system calls.

Blocking outside the global interpreter lock

This method only works on Ruby 1.9. Unlock the global interpreter lock, do your thing, then re-locks it. Dealing with the global interpreter lock will be discussed later.

Non-blocking polling

Some system calls have non-blocking equivalents which return a certain error instead of blocking. For example waitpid() blocks by default, but it can be set to non-blocking by passing the WNOHANG flag, which causes it to return immediately with an error instead of blocking. You must call the non-blocking version in a loop. Upon detecting a blocking error, you must call rb_thread_polling(). On 1.8 this function lets the scheduler put the current thread to sleep for 60 msec, on 1.9 for 100 msec.

For example, Ruby’s Process#waitpid function does not block other threads. On 1.9 it simply unlocks the global interpreter lock while blocking on waitpid(). On 1.8 it is implemented as follows (simplified version):

retry:

int result = waitpid(..., WNOHANG);
if (result < 0) {
    if (errno == EINTR) {
        /* Process isn't ready yet. Tell the scheduler and then restart the call. */
        rb_thread_polling();
        goto retry;
    } else {
        ...throw exception...
    }
}

The actual code is actually more optimized than this. For example if there's only a single thread in the system then it calls waitpid() without WNOHANG and just have it block.

Calling the system call in a native OS thread and use I/O to report results

This is probably the most complex way but on 1.8 sometimes you don't have any choice. On 1.9 you should always prefer unlocking the global interpreter lock over this method.

Create a pipe, then spawn a native OS thread which calls the system call. When the system call is done, have your native thread report the result back via the pipe. On the Ruby side, use rb_thread_wait_fd() and friends to block on the pipe and then receive the results. Be sure to join the thread after you've read the result because rb_thread_wait_fd() does not necessarily block until there is data, so when rb_thread_wait_fd() returns it is not guaranteed that the thread has returned yet.

Another thing to watch out for is that your thread must not refer to data that's on the Ruby thread's stack. This is because Ruby overwrites the main OS thread's C stack upon context switching to another Ruby thread. For example code like this is not OK:

static void thread_main(int *value) {
    /* 'value' here refers to the 'value' variable on foobar's stack, but
     * that data is overwritten when Ruby context switches, so we
     * really can't use 'value' here!
     */
}

/* Native extension Ruby method. */
static void foobar() {
    int value = 1234;
    
    thread_t thread = create_a_thread(thread_main, &value);
    ...do something which can cause a Ruby thread context switch...
    join_thread(thread);
}

To pass data to the thread, you should put the data on the heap instead of the stack. This is OK:

typedef struct {
    ...
} Data;

static void thread_main(Data *data) {
    /* 'data' is safe to access. */
}

/* Native extension Ruby method. */
static void foobar() {
    Data *data = malloc(sizeof(Data));
    thread_t thread = create_a_thread(thread_main, data);
    ...do something which can cause a Ruby thread context switch...
    join_thread(thread);
    free(data);
}

Heavy CPU computations

Not only blocking system calls can block other threads, CPU-heavy computation code can also do that. While executing non-Ruby-API C code, context switching to other threads is not possible. Calls to Ruby APIs may sometimes cause context switching. However there are several ways to make context switching possible while running CPU-heavy computations.

Unlocking the global interpreter lock

This only works on 1.9. Unlock the global interpreter lock and then call the computation code, and relock when done. Consider BCrypt-Ruby as an example. BCrypt is a very heavy hashing algorithm used for securely hashing passwords; depending on the configured cost it could need several minutes to calculate a hash. We've recently patched BCrypt-Ruby to unlock the global interpreter lock while running the BCrypt algorithm, so that when you run BCrypt-Ruby in multiple threads the algorithms can be spread across multiple CPU cores.

However, be aware of the fact that unlocking and relocking the global interpreter lock comes with some overhead as well. Unlocking and relocking the global interpreter lock is only worth it if you know that the computation is going to take a while (say, longer than 50 msec). If the computation time is short then you will actually make your code slower because of all the locking overhead. Therefore BCrypt-Ruby only unlocks the global interpreter lock if the BCrypt cost is set to 9 or higher.

Explicit yielding

You can call rb_thread_schedule() once in a while to force context switching to another thread. However this approach does not allow your code to make use of multiple cores even if you're on 1.9.

Running the C code in a native OS thread

This is pretty much the same approach as described by "Calling the system call in a native OS thread and use I/O to report results". In my opinion, unless your computation takes a very long time, implementing this is almost never worth the trouble. For BCrypt-Ruby we didn't bother: if you want multi-core support in BCrypt-Ruby you need to be on 1.9.

TRAP_BEG/TRAP_END and the global interpreter lock

TRAP_BEG and TRAP_END

On 1.8, you should surround system calls with calls to TRAP_BEG and TRAP_END. TRAP_BEG performs some preparation work. TRAP_END performs a variety of things:

  1. It checks whether there are any pending signals, e.g. whether the user pressed Ctrl-C. If so it will raise an appropriate SignalException.
  2. It also calls the scheduler if a certain amount of time has been spent on the current thread.

On 1.9 TRAP_BEG and TRAP_END are macros that unlock and lock the global interpreter lock. However these macros are deprecated and are likely to disappear in the future so you should not use them on 1.9. Instead, you should use rb_thread_blocking_region().

On 1.9 TRAP_BEG and TRAP_END are defined in ruby/backward/rubysig.h.

rb_thread_blocking_region()

This is a 1.9-specific function which allows you to call a function outside the global interpreter lock. Its declaration is as follows:

rb_thread_blocking_region(rb_blocking_function_t *func, void *data1,
                          rb_unblock_function_t *ubf, void *data2);

func is a pointer to a function that is to be called outside the global interpreter lock. This function must look similar to:

VALUE foobar(void *data)

The data passed via the data1 parameter is passed to the function.

ubf is either RUBY_UBF_IO (indicating that you're performing some kind of I/O operation) or RUBY_UBF_PROCESS (indicating that you're calling some kind of process management system call). However I'm not sure what this parameter exactly does. data2 is supposedly passed to ubf when it's called.

The return value of this function is the return value of func.

Global interpreter lock caveats

Do not call any Ruby API functions while the global interpreter lock is unlocked! No rb_yield(), rb_str_new(), or anything. The entirety of the Ruby API is only safe to call when the global interpreter lock is obtained.

Securely store passwords with bcrypt-ruby; now compatible with JRuby and Ruby 1.9

By Hongli Lai on August 13th, 2009

When writing web applications, or any application for that manner, any passwords should be stored securely. As a rule of thumb, one should never store passwords as clear text in the database for the following reasons:

  • If the database ever gets leaked out, then all accounts are compromised until every single user resets his password. Imagine that you’re an MMORPG developer; leaking out the database with clear text passwords allows the attacker to delete every player’s characters.
  • Many people use the same password for multiple sites. Imagine that the password stored in your database is also used for the user’s online banking account. Even if the database does not get leaked out, the password is still visible to the system administrator; this can be a privacy breach.

There are several “obvious” alternatives, which aren’t quite secure enough:

Storing passwords as MD5/SHA1/$FAVORITE_ALGORITHM hashes
These days MD5 can be brute-force cracked with relatively little effort. SHA1, SHA2 and other algorithms are harder to brute-force, but the attacker can still crack these hashes by using rainbow tables: precomputed tables of hashes with which the attacker can look up the input for a hash with relative ease. This rainbow table does not have to be very large: it just has to contain words from the dictionary, because many people use dictionary words as passwords.

Using plain hashes also makes it possible for an attacker to determine whether two users have the same password.

Encrypting the password
This is not a good idea because if the attacker was able to steal the database, then there’s a possibility that he’s able to steal the key file as well. Plus, the system administrator is able to read everybody’s passwords, unless he’s restricted access to either the key file or the database.

The solution is to store passwords as salted hashes. One calculates a salted hash as follows:

salted_hash = hashing_algorithm(salt + cleartext_password)

Here, salt is a random string. After calculating the salted hash, one should store the salted hash in the database, along with the (cleartext) salt. It is not necessary to keep the salt secret or to obfuscate it.

When a user logs in, one can verify his password by re-computing the salted hash and comparing it with the salted hash in the database:

salted_hash = hashing_algorithm(salt_from_database + user_provided_password)
if (salted_hash == salted_hash_from_database):
    user is logged in
else:
    password incorrect

The usage of the salt forces the attacker to either brute-force the hash or to use a ridiculously large rainbow table. In case of the latter, the sheer size of the required rainbow table can make it unpractical to generate. The larger the salt, the more difficult it becomes for the cracker to use rainbow tables.

However, even with salting, one should still not use SHA1, SHA2, Whirlpool or most other hashing algorithms because these algorithms are designed to be fast. Although brute forcing SHA2 and Whirlpool is hard, it’s still possible given sufficient resources. Instead, one should pick a hashing algorithm that’s designed to be slow so that brute forcing becomes unfeasible. Bcrypt is such a slow hashing algorithm. A speed comparison on a MacBook Pro with 2 Ghz Intel Core 2 Duo:

  • SHA-1: 118600 hashes per second.
  • Bcrypt (with cost = 10): 7.7 hashes per second.

Theoretically it would take 4*10^35 years for a single MacBook Pro core to crack an SHA-1 hash, assuming that the attacker does not harness any weaknesses in SHA-1. To crack a bcrypt hash one would need 6*10^39 years, or 10000 more times. Therefore, we recommend the use of bcrypt to store passwords securely.

There’s even a nice Ruby implementation of this algorithm: bcrypt-ruby! Up until recently, bcrypt-ruby was only available for MRI (“Matz Ruby Interpreter”, the C implementation that most people use). However, we’ve made it compatible with JRuby! The code can be found in our fork at Github. The current version also has issues with Ruby 1.9, which we’ve fixed as well. The author of bcrypt-ruby has already accepted our changes and will soon release a new version with JRuby and Ruby 1.9 support.

Further recommended reading

How to Safely Store a Password by Coda Hale.