The new Rack socket hijacking API
Yesterday saw the release of Rack 1.5.0, which adds a new feature to the Rack specification dubbed socket hijacking. This feature allows applications to take over the client socket and perform arbitrary operations on it, e.g. implementing WebSockets, streaming data to the client, etc.
Did Rack not support streaming? Actually yes it did, you can do it by returning a body object that outputs body chunks in the #each method, as explained in our past article Why Rails 4 Live Streaming is a Big Deal. But this API is a bit clunky. The socket hijacking API provides access to a Ruby IO object-like API.
Support for socket hijacking has been added to Phusion Passenger 4 yesterday. The upcoming Phusion Passenger 4 has been covered here, here and here. Phusion Passenger Enterprise customers can already test and enjoy a preview of this feature by downloading the “3.9.2 beta preview (4.0.0 beta 2)” file from the Customer Area.
The socket hijacking API was surprisingly easy to implement, but unfortunately poorly documented at this time. The application-level API is not immediately obvious, and the Rack specification documentation has not yet been updated to cover the hijacking API. In this article we’ll introduce the API and provide an example program.
What the socket hijacking API is not
Some of you may have heard of efforts to develop a “Rack 2.0″ specification which properly covers things such as streaming and evented servers. According to the hijacking API developer, this API is not an attempt towards Rack 2.0. It is a “good enough” solution that works within the confines of the Rack 1.x specification. Things may change in Rack 2.0, though at this time it’s unclear what the progress towards Rack 2.0 is.
It is also unclear whether the API is supposed to be final or not. While implementing this API and writing this article we’ve discovered some room for improvement. The suggestions (which you can find later in this article) have been submitted to the developers.
Overview of the API
The hijacking API provides two modes:
- A full hijacking API, which gives the application complete control over what goes over the socket. In this mode, the application server doesn’t send anything over the socket, and lets the application take care of it. This mode is useful if you want to implement arbitrary (even non-HTTP) protocols over the socket. This is subject to limitations: if your application is behind a web server or an HTTP load balancer then those components dictate which protocols you can implement.
- A partial hijacking API, which gives the application control over the socket after the application server has already sent out headers. This mode is mostly useful for streaming.
The hijacking API is accessible through the Rack env hash. You can check whether the application server supports the hijacking API by checking env['rack.hijack?'], which returns a boolean value.
Full hijacking
You can perform a full hijack by calling env['rack.hijack'].call. You can access the hijacked socket object through env['rack.hijack_io']. Phusion Passenger’s implementation of env['rack.hijack'] returns the socket object, but it is unclear whether this is supposed to be standard behavior.
You are responsible for:
- Outputting any HTTP headers, if applicable.
- Closing the IO object when you no longer need it.
You should output the “Connection: close” header unless you plan on implementing HTTP keep-alive yourself.
Here’s am example of the full hijacking API in action:
# encoding: utf-8
require 'thread'
# Streams the response "Line 1" .. "Line 10", with
# 1 second sleep time between each line.
#
# Non-Phusion Passenger users may have to turn off their
# web servers' buffering options for streaming to work.
# Phusion Passenger 4 users don't have to do anything, it
# works out-of-the-box thanks to our real-time response
# buffering feature.
app = lambda do |env|
# Fully hijack the client socket.
env['rack.hijack'].call
io = env['rack.hijack_io']
begin
io.write("Status: 200\r\n")
io.write("Connection: close\r\n")
io.write("Content-Type: text/plain\r\n")
io.write("\r\n")
10.times do |i|
io.write("Line #{i + 1}!\n")
io.flush
sleep 1
end
ensure
io.close
end
end
run app
Partial hijacking
You can perform a partial hijack by assigning a lambda to the rack.hijack response header. This lambda will be called after the application server has sent out headers. The application server will ignore the body part of the Rack response, and will call the ‘rack.hijack’ lambda, passing it the client socket. You are responsible for closing the socket when it’s no longer needed.
It is unclear what the value of the Rack response body should be. Phusion Passenger’s implementation doesn’t care: you can return a two-array response, or a three-array response where where the body can be anything. If the ‘rack.hijack’ response header is set, the body will be completely ignored.
Example:
# encoding: utf-8
require 'thread'
# Streams the response "Line 1" .. "Line 10", with
# 1 second sleep time between each line.
#
# Non-Phusion Passenger users may have to turn off their
# web servers' buffering options for streaming to work.
# Phusion Passenger 4 users don't have to do anything, it
# works out-of-the-box thanks to our real-time response
# buffering feature.
app = lambda do |env|
response_headers = {}
response_headers["Content-Type"] = "text/plain"
response_headers["rack.hijack"] = lambda do |io|
# This lambda will be called after the app server has outputted
# headers. Here we can output body data at will.
begin
10.times do |i|
io.write("Line #{i + 1}!\n")
io.flush
sleep 1
end
ensure
io.close
end
end
[200, response_headers, nil]
end
run app
Issues with the hijacking API
Here’s how we think the hijacking API can be improved.
env['rack.hijack?']appears to be unnecessary. You can already check for hijacking support by checkingenv['rack.hijack'].- The partial hijacking API should not involve assigning a lambda to the response headers. As far as we can see, you can just return the lambda as the body. That would be a much more elegant solution.
- The return value for
env['rack.hijack']should be well-defined.
Conclusion
The Rack hijacking API, while having some quirks in our opinion, gets the job done. We hope that the usage of the hijacking API has become more clear after reading this article. If you have any comments, questions, suggestions or corrections, please let us know.
We at Phusion are working feverishly at the upcoming Phusion Passenger 4 (covered here, here and here). Implementing the hijacking API so quickly is our way of showing you how dedicated we are. Together with Phusion Passenger Enterprise, we aim to deliver the most stable, performant and feature rich polyglot application server out there. If you’re interested in future updates, please subscribe to our newsletter. Until next time!
Phusion Passenger 4.0 supports JRuby, Rubinius
Phusion Passenger is an Apache and Nginx module for deploying Ruby and Python web applications. It has a strong focus on ease of use, stability and performance. Phusion Passenger is built on top of tried-and-true, battle-hardened Unix technologies, yet at the same time introduces innovations not found in most traditional Unix servers. Since mid-2012, it aims to be the ultimate polyglot application server.
In the announcement for Phusion Passenger 4.0.0 beta 1, we introduced a myriad of changes such as support for multiple Ruby versions, Python WSGI support, multithreading (Enterprise only), improved zero-copy architecture, better error diagnostics and more. And as we promised, the story would not end there. A commit has just landed in our Github repository for JRuby (1.7.0 required) and Rubinius support!
JRuby: the past and the current state of affairs
JRuby is an excellent Ruby implementation for the JVM, and in the past few years they have been doing a great job with regard to compatibility and performance. But for a long time, application server support for JRuby had been limited:
- While Mongrel and Thin had limited JRuby support, these setups have not been very popular. Since so few people use these setups, their caveats are not very well known.
- Unicorn does not support JRuby at all because it was designed to take advantage of Unix features, which JRuby does not (and cannot) always support well.
- Phusion Passenger was in the same position: we used too many Unix features and were not able to support JRuby well.
- Goliath does not seem to have official support for JRuby thanks to the unknown status of EventMachine’s Java support.
So the only options left were J2EE app servers such as JBoss, Tomcat, GlassFish and TorqueBox; as well as the recently developed Puma, which is almost pure Ruby.
Thanks to the new ApplicationPool and Spawner architecture in Phusion Passenger 4, we’re now able to support JRuby with ease. Because a lot of code has been moved into C++, we no longer need the Ruby implementation to support Unix features. We only needed an hour to add support for JRuby.
Phusion Passenger vs J2EE
With Phusion Passenger’s support for JRuby, you don’t need to learn about J2EE deployment. Using JRuby on Phusion Passenger is very straightforward: set PassengerRuby to your JRuby command, point the virtual host’s document root to your application’s ‘public’ directory, and you’re done.
With Phusion Passenger Enterprise, JRuby users get to enjoy all the enterprise features such as multithreading, rolling restarts, deployment error resistance, time and memory usage limiting, and more.
Rubinius is impressive as well
We remember that back in the days, Rubinius was quite slow during startup and did not support MRI native extensions. Fast forward to 2012, and what we find is a very impressive Ruby implementation. They have 1.9 support and MRI extension support. The Ruby interpreter starts quickly. They support Unix features. Adding Rubinius support was pretty straightforward. The Rubinius team has done an excellent job!
Why you should use JRuby or Rubinius
JRuby and Rubinius support real multi-core concurrency. JRuby and Rubinius threads map to real OS threads, and neither Ruby implementations have a global interpreter lock. In contrast, MRI Ruby 1.8 uses userspace threading and so cannot take advantage of multi-core using a single process. MRI Ruby 1.9 has real OS threads, but also has a global interpreter lock and so still cannot take advantage of multi-core using a single process.
Granted, the multi-core issue isn’t that big. Phusion Passenger spawns multiple processes in order to take advantage of multi-core. But if you’re in a position in which you can only use 1 process, for whatever reason, then JRuby and Rubinius are what you need. With Phusion Passenger Enterprise’s multithreading support, you can have hybrid multi-processed and multi-threaded applications – the best of both worlds.
JRuby and Rubinius also often have superior performance. Both implementations support JIT compilation, which MRI Ruby does not.
That said, MRI Ruby still has the best compatibility in the Ruby ecosystem, so JRuby and Rubinius are not silver bullets. You should use the best tool for the best job. With Phusion Passenger 4′s support for multiple Ruby versions, this should be a breeze.
Where to get Phusion Passenger with JRuby/Rubinius support
JRuby/Rubinius support will become part of the upcoming 4.0.0 beta 2. Please stay tuned!
SHA-3 extensions for Ruby and Node.js
A few days ago, NIST announced the winner of the SHA-3 competition: Keccak (prounced [kɛtʃak], ketchak). The researchers who authored Keccak released a reference implementation in C.
We’ve created Ruby and Node.js extensions for Keccak. Our extensions utilize the code from the official reference implementation and come with a extensive suite of unit tests. But note however that I do not claim to be a security expert. Feel free to review the code for any flaws.
Install with:
gem install digest-sha3
npm install sha3
We’ve strived to emulate the languages’ standard hash libraries’ interfaces, so using these extensions is straightforward:
require 'digest/sha3'
Digest::SHA3.hexdigest("foo")
and
var SHA3 = require('sha3');
var hash = SHA3.SHA3Hash();
hash.update('foo');
hash.digest();
Both libraries are MIT licensed. Enjoy!
Why does the world need SHA-3?
If you’re not a security researcher then you’ve undoubtedly asked the same questions as I did. What’s wrong with SHA-1, SHA-256 and SHA-512? Why does the world need SHA-3? Why was Keccak the winner?
According to to well-known security researcher Bruce Schneier, there’s nothing wrong with the SHA-2 family of hashing functions. However he likes SHA-3 because it’s completely different. SHA-1 and SHA-2 are both based on the Merkle–Damgård construction. It may be feasible to find a flaw in SHA-1 that also affects SHA-2. In contrast, Keccak is based on the sponge construction so any attacks on SHA-1 and SHA-2 will probably not affect Keccak. Indeed, it appears that NIST chose Keccak because it was looking for some kind of insurance in case SHA-2 would be broken. Many people commented that they expected Skein to win.
Do not hash your passwords
In any case, the following cannot be repeated enough. Do not use SHA-3 (or SHA-256, SHA-512, RIPEMD-160 or whatever fast hash) to hash passwords! Do not even use SHA-3 + salt to hash passwords. Instead, use a slow hash like bcrypt. As Coda Hale explained in his article, you’ll want the hash to be slow so you can defend against attackers effectively.
How to fix the Ruby 1.9 HTTPS/Bundler segmentation fault on OS X Lion
If you’ve installed a gem bundle on OS X Lion the past few weeks then you may have seen the dreaded “[BUG] Segmentation fault” error, where Ruby sees to crash in the connect C function in http.rb. Upgrading to the latest Ruby 1.9.3 version (p194) doesn’t seem to help. Luckily someone has found a solution for this problem.
It turns out the segmentation fault is caused by an incompatibility between MacPort’s OpenSSL and RVM. MacPorts installs everything to /opt/local but RVM does not look for OpenSSL in /opt/local. We solved the problem by reinstalling Ruby 1.9.3 with the MacPorts OpenSSL, as follows.
First edit $HOME/.rvmrc and add:
export CFLAGS="-O2 -arch x86_64"
export LDFLAGS="-L/opt/local/lib"
export CPPFLAGS="-I/opt/local/include"
Then run:
sudo port install libyaml
rvm reinstall ruby-1.9.3 --with-openssl-dir=/opt/local --with-opt-dir=/opt/local
Bundler and public applications
I think Bundler is a great tool. Its strength lies not in its ability to install all the gems that you’ve specified, but in automatically figuring out a correct dependency graph so that nothing conflicts with each other, and in the fact that it gives you rock-solid guarantees that whatever gems you’re using in development is exactly what you get in production. No more weird gem version conflict errors.
This is awesome for most Ruby web apps that are meant to be used internally, e.g. things like Twitter, Basecamp, Union Station. Unfortunately, this strength also turns in a kind of weakness when it comes to public apps like Redmine and Juvia. These apps typically allow the user to choose their database driver through config/database.yml. However the driver must also be specified inside Gemfile, otherwise the app cannot load it. The result is that the user has to edit both database.yml and Gemfile, which introduces the following problems:
- The user may not necessarily be a Ruby programmer. The Gemfile will confuse him.
- The user is not able to use the Gemfile.lock that the developer has provided. This makes installing in deployment mode with the developer-provided Gemfile.lock impossible.
This can be worked around in a very messy form with groups. For example:
group :driver_sqlite do gem 'sqlite3' end group :driver_mysql do gem 'msyql' end group :driver_postgresql do gem 'pg' end
And then, if the user chose to use MySQL:
bundle install --without='driver_postgresql driver_sqlite'
This is messy because you have to exclude all the things you don’t want. If the app supports 10 database drivers then the user has to put 9 drivers on the exclusion list.
How can we make this better? I propose supporting conditionals in the Gemfile language. For example:
condition :driver => 'sqlite' do gem 'sqlite3' end condition :driver => 'mysql' do gem 'mysql' end condition :driver => 'postgresql' do gem 'pg' end condition :driver => ['mysql', 'sqlite'] do gem 'foobar' end
The following command would install the mysql and the foobar gems:
bundle install --condition driver=mysql
Bundler should enforce that the driver condition is set: if it’s not set then it should raise an error. To allow for the driver condition to not be set, the developer must explicitly define that the condition may be nil:
condition :driver => nil do gem 'null-database-driver' end
Here, bundle install will install null-database-driver.
With this proposal, user installation instructions can be reduced to these steps:
- Edit database.yml and specify a driver.
- Run
bundle install --condition driver=(driver name)
I’ve opened a ticket for this proposal. What do you think?
Making Ruby threadable: properly handling context switching in native extensions
In the previous article Does Rails Performance Need an Overhaul? we had discussed the fact that proper Ruby threading is hindered by various broken native extensions. Writing a native extension for Ruby is pretty easy, however writing it right can not only be difficult, but can also be an obscure practice that requires l33t sk1llz because of the lack of documentation in this area. We’ve written several native extensions so far and in the process of figuring out how to make threading-friendly native extensions we had to wade through tons of Ruby source code. In this article I want to teach some best practices in the writing of threading-friendly native extensions.
Threading basics
As discussed in the previous article, Ruby 1.8 implements userspace threads, meaning that no matter how many Ruby threads you have, only one can run at a time, and only on a single CPU core. The threads are scheduled by Ruby itself, not by the operating system.
Ruby 1.9 implements native operating system threads. However it has a global interpreter lock which must be locked when the thread is running Ruby code. This effectively makes Ruby 1.9 single threaded most of the time.
With both Ruby 1.8 and 1.9 threads, system calls such as I/O operations can block the thread and preventing Ruby from context switching to another. Thus, system calls require special attention. Expensive calculations that do not involve system calls can also block the thread, but something can be done those as well, as you will read later on.
Handling I/O
Suppose that you have a file descriptor on which you want to perform some potentially blocking I/O. The naive approach is to perform the I/O command anyway and risk blocking the entire Ruby process. This is exactly what makes the mysql extension thread-unfriendly: while waiting on MySQL no other threads can run, grinding your multi-threaded Rails web app to a halt.
However there are a number of functions in your arsenal that you can use to combat this problem. And as a general rule, you should set to file descriptors to non-blocking mode.
rb_thread_wait_fd(fd)
Just before performing a blocking read, you should call rb_thread_wait_fd() on the file descriptor that you’re reading from. On 1.8, this function marks the current thread as waiting for readable data on this file descriptor and then invokes the scheduler. The scheduler uses the select() system call to check which file descriptors are readable and then selects a thread which may continue. If the file descriptor that you were waiting on is not readable, then your thread will be suspended until the next time the scheduler is invoked and selects your thread. But even if the file descriptor is immediately readable, the scheduler does not guarantee that your thread will be selected immediately.
On 1.9, rb_thread_wait_fd() simply unlocks the global interpreter lock, calls the select() system call on the given file descriptor, and re-acquires the global interpreter lock when select() returns. While select() is blocking, other threads can run.
As an optimization, if only the main thread exists then this function does nothing. This applies to both 1.8 and 1.9.
rb_thread_fd_writable(fd)
This works the same as rb_thread_fd(), but waits until the given file descriptor becomes writable. The single-thread optimization applies here too. You should call rb_thread_fd_writable() just before you perform a write I/O operation.
rb_thread_select()
To wait on multiple file descriptors, use this function instead of select() or poll(). Unlike the native system calls, this function will take care of invoking the scheduler or unlocking the global interpreter lock. Unlike rb_thread_wait_fd() and rb_thread_fd_writable(), there is no do-nothing-when-there’s-only-one-thread optimization here so it will always invoke the scheduler and call select().
rb_io_wait_readable()
I/O system calls can return a variety of error codes that indicate that you should restart the system call, such as EINTR (system call interrupted by signal) and EAGAIN (the file descriptor is set to non-blocking mode and the data is not yet available). You should therefore always call I/O system calls in a loop until it returns success or a different error code. You must however not forget to call rb_thread_wait_fd() or rb_thread_select() before you restart the system call, or you will risk blocking the thread again.
Ruby provides a function rb_io_wait_readable() to aid you in writing restart code. This function should be called right after your I/O reading system call has returned. It checks whether the system call should be restarted (returning Qtrue) or whether you should report an error (returning Qfalse). Here’s a code example:
int done = 0;
int ret;
/* Have the Ruby scheduler suspend this thread until the file descriptor becomes
* readable; or if this is the only thread in the system, rb_thread_wait_fd() does
* nothing and we immediately continue to the 'do' loop.
*/
rb_thread_wait_fd(fd);
do {
/* Actually you should surround your system call with some more code, but
* we'll get to this later. This example code is only partial. */
ret = ...your read system call here...
if (ret == -1) {
if (rb_io_wait_readable(fd) == Qfalse) {
...throw an exception here...
} /* else restart loop */
} else {
done = 1;
}
} while (!done);
rb_io_wait_readable() checks whether errno equals EINTR or ERESTART, in which case it will call rb_thread_wait_for() on the file descriptor and return Qtrue. If errno is EAGAIN or EWOULDBLOCK then it calls rb_thread_select() on the file descriptor and returns true. Otherwise it returns false.
The difference between calling rb_thread_wait_for() and rb_thread_select() here is subtle, but important. The former only blocks (calls select() on the file descriptor) when there are multiple Ruby threads in the Ruby process, while the latter always blocks no matter what. This behavior is important because EAGAIN and EWOULDBLOCK occur when a non-blocking file descriptor is not yet readable; if we don’t block here on a select() then the code will enter a 100% CPU busy loop.
rb_io_wait_writable()
Works the same way as rb_io_wait_readable(). Use this for I/O write operations instead.
Sleeping
Use rb_thread_wait_for() instead of sleep() or usleep(). On 1.8 rb_thread_wait_for() marks the current thread as sleeping for a period of time and then invokes the scheduler, which does not select this thread until the period of time has expired. On 1.9 Ruby unlocks the global interpreter lock, calls some sleeping function, and then re-locks it after that function returns.
Other non-I/O blocking system calls
Sometimes you will want to wait on a blocking system call that isn’t related to I/O, such as waitpid(). There are several ways to deal with these kind of system calls.
Blocking outside the global interpreter lock
This method only works on Ruby 1.9. Unlock the global interpreter lock, do your thing, then re-locks it. Dealing with the global interpreter lock will be discussed later.
Non-blocking polling
Some system calls have non-blocking equivalents which return a certain error instead of blocking. For example waitpid() blocks by default, but it can be set to non-blocking by passing the WNOHANG flag, which causes it to return immediately with an error instead of blocking. You must call the non-blocking version in a loop. Upon detecting a blocking error, you must call rb_thread_polling(). On 1.8 this function lets the scheduler put the current thread to sleep for 60 msec, on 1.9 for 100 msec.
For example, Ruby’s Process#waitpid function does not block other threads. On 1.9 it simply unlocks the global interpreter lock while blocking on waitpid(). On 1.8 it is implemented as follows (simplified version):
retry:
int result = waitpid(..., WNOHANG);
if (result < 0) {
if (errno == EINTR) {
/* Process isn't ready yet. Tell the scheduler and then restart the call. */
rb_thread_polling();
goto retry;
} else {
...throw exception...
}
}
The actual code is actually more optimized than this. For example if there's only a single thread in the system then it calls waitpid() without WNOHANG and just have it block.
Calling the system call in a native OS thread and use I/O to report results
This is probably the most complex way but on 1.8 sometimes you don't have any choice. On 1.9 you should always prefer unlocking the global interpreter lock over this method.
Create a pipe, then spawn a native OS thread which calls the system call. When the system call is done, have your native thread report the result back via the pipe. On the Ruby side, use rb_thread_wait_fd() and friends to block on the pipe and then receive the results. Be sure to join the thread after you've read the result because rb_thread_wait_fd() does not necessarily block until there is data, so when rb_thread_wait_fd() returns it is not guaranteed that the thread has returned yet.
Another thing to watch out for is that your thread must not refer to data that's on the Ruby thread's stack. This is because Ruby overwrites the main OS thread's C stack upon context switching to another Ruby thread. For example code like this is not OK:
static void thread_main(int *value) {
/* 'value' here refers to the 'value' variable on foobar's stack, but
* that data is overwritten when Ruby context switches, so we
* really can't use 'value' here!
*/
}
/* Native extension Ruby method. */
static void foobar() {
int value = 1234;
thread_t thread = create_a_thread(thread_main, &value);
...do something which can cause a Ruby thread context switch...
join_thread(thread);
}
To pass data to the thread, you should put the data on the heap instead of the stack. This is OK:
typedef struct {
...
} Data;
static void thread_main(Data *data) {
/* 'data' is safe to access. */
}
/* Native extension Ruby method. */
static void foobar() {
Data *data = malloc(sizeof(Data));
thread_t thread = create_a_thread(thread_main, data);
...do something which can cause a Ruby thread context switch...
join_thread(thread);
free(data);
}
Heavy CPU computations
Not only blocking system calls can block other threads, CPU-heavy computation code can also do that. While executing non-Ruby-API C code, context switching to other threads is not possible. Calls to Ruby APIs may sometimes cause context switching. However there are several ways to make context switching possible while running CPU-heavy computations.
Unlocking the global interpreter lock
This only works on 1.9. Unlock the global interpreter lock and then call the computation code, and relock when done. Consider BCrypt-Ruby as an example. BCrypt is a very heavy hashing algorithm used for securely hashing passwords; depending on the configured cost it could need several minutes to calculate a hash. We've recently patched BCrypt-Ruby to unlock the global interpreter lock while running the BCrypt algorithm, so that when you run BCrypt-Ruby in multiple threads the algorithms can be spread across multiple CPU cores.
However, be aware of the fact that unlocking and relocking the global interpreter lock comes with some overhead as well. Unlocking and relocking the global interpreter lock is only worth it if you know that the computation is going to take a while (say, longer than 50 msec). If the computation time is short then you will actually make your code slower because of all the locking overhead. Therefore BCrypt-Ruby only unlocks the global interpreter lock if the BCrypt cost is set to 9 or higher.
Explicit yielding
You can call rb_thread_schedule() once in a while to force context switching to another thread. However this approach does not allow your code to make use of multiple cores even if you're on 1.9.
Running the C code in a native OS thread
This is pretty much the same approach as described by "Calling the system call in a native OS thread and use I/O to report results". In my opinion, unless your computation takes a very long time, implementing this is almost never worth the trouble. For BCrypt-Ruby we didn't bother: if you want multi-core support in BCrypt-Ruby you need to be on 1.9.
TRAP_BEG/TRAP_END and the global interpreter lock
TRAP_BEG and TRAP_END
On 1.8, you should surround system calls with calls to TRAP_BEG and TRAP_END. TRAP_BEG performs some preparation work. TRAP_END performs a variety of things:
- It checks whether there are any pending signals, e.g. whether the user pressed Ctrl-C. If so it will raise an appropriate SignalException.
- It also calls the scheduler if a certain amount of time has been spent on the current thread.
On 1.9 TRAP_BEG and TRAP_END are macros that unlock and lock the global interpreter lock. However these macros are deprecated and are likely to disappear in the future so you should not use them on 1.9. Instead, you should use rb_thread_blocking_region().
On 1.9 TRAP_BEG and TRAP_END are defined in ruby/backward/rubysig.h.
rb_thread_blocking_region()
This is a 1.9-specific function which allows you to call a function outside the global interpreter lock. Its declaration is as follows:
rb_thread_blocking_region(rb_blocking_function_t *func, void *data1,
rb_unblock_function_t *ubf, void *data2);
func is a pointer to a function that is to be called outside the global interpreter lock. This function must look similar to:
VALUE foobar(void *data)
The data passed via the data1 parameter is passed to the function.
ubf is either RUBY_UBF_IO (indicating that you're performing some kind of I/O operation) or RUBY_UBF_PROCESS (indicating that you're calling some kind of process management system call). However I'm not sure what this parameter exactly does. data2 is supposedly passed to ubf when it's called.
The return value of this function is the return value of func.
Global interpreter lock caveats
Do not call any Ruby API functions while the global interpreter lock is unlocked! No rb_yield(), rb_str_new(), or anything. The entirety of the Ruby API is only safe to call when the global interpreter lock is obtained.
Does Rails Performance Need an Overhaul?
Igvita.com has recently published the article Rails Performance Needs an Overhaul. Rails performance… no, Ruby performance… no Rails scalability… well something is being criticized here. From my experience, talking about scalability and performance can be a bit confusing because the terms can mean different things to different people and/or in different situations, yet the meanings are used interchangeably all the time. In this post I will take a closer look at Igvita’s article.
Performance vs scalability
Let us first define performance and scalability. I define performance as throughput; number of requests per second. I define scalability as the amount of users a system can concurrently handle. There is a correlation between performance and scalability. Higher performance means each request takes less time, and so is more scalable, right? Sometimes yes, but not necessarily. It is entirely possible for a system to be scalable, yet manages to have a lower throughput than a system that’s not as scalable, or for a system to be uber-fast yet not very scalable. Throughout this blog post I will show several examples that highlight the difference.
“Scalability” is an extremely loaded word and people often confuse it with “being able to handle tons and tons of traffic”. Let’s use a different term that better reflects what Igvita’s actually criticizing: concurrency. Igvita claims that concurrency in Ruby is pathetic while referring to database drivers, Ruby application servers, etc. Some practical examples that demonstrate what he means are as follows.
Limited concurrency at the app server level
Mongrel, Phusion Passenger and Unicorn all use a “traditional” multi-process model in which multiple Ruby processes are spawned, each process handling a single request per second. Thus, concurrency is (assuming that the load balancer has infinite concurrency) limited by the number of Ruby processes: having 5 processes allow you to handle 5 users concurrently.
Threaded servers, where the server spawns multiple threads, each handling 1 connection concurrently, allow more concurrency because because it’s possible to spawn a whole lot more threads than processes. In the context of Ruby, each Ruby process needs to load its own copy of the application code and other resources, so memory increases very quickly as you spawn additional processes. Phusion Passenger with Ruby Enterprise Edition solves this problem somewhat by using copy-on-write optimizations which save memory, so you can spawn a bit more processes, but not significantly (as in 10x) more. In contrast, a multi-threaded app server does not need as much memory because all threads share application code with each other so you can comfortably spawn tens or hundreds of threads. At least, this is the theory. I will later explain why this does not necessarily hold for Ruby.
When it comes to performance however, there’s no difference between processes and threads. If you compare a well-written multi-threaded app server with 5 threads to a well-written multi-process app server with 5 processes, you won’t find either being more performant than the other. Context switch overhead between processes and threads are roughly the same. Each process can use a different CPU core, as can each thread, so there’s no difference in multi-core utilization either. This reflects back on the difference between scalability/concurrency and performance.
Multi-process Rails app servers have a concurrency level that can be counted with a single hand, or if you have very beefy hardware, a concurrency level in the range of a couple of tens, thanks to the fact that Rails needs about 25 MB per process. Multi-threaded Rails app servers can in theory spawn a couple of hundred of threads. After that it’s also game over: an operating system thread needs a couple MB of stack space, so after a couple hundreds of threads you’ll run out of virtual memory address on 32-bit systems even if you don’t actually use that much memory.
There is another class of servers, the evented ones. These servers are actually single-threaded, but they use a reactor style I/O dispatch architecture for handling I/O concurrency. Examples include Node.js, Thin (built on EventMachine) and Tornado. These servers can easily have a concurrency level of a couple of thousand. But due to their single-threaded nature they cannot effectively utilize multiple CPU cores, so you need to run a couple of processes, one per CPU core, to fully utilize your CPU.
The limits of Ruby threads
Ruby 1.8 uses userspace threads, not operating system threads. This means that Ruby 1.8 can only utilize a single CPU core no matter how many Ruby threads you create. This is why one typically needs multiple Ruby processes to fully utilize one’s CPU cores. Ruby 1.9 finally uses operating system threads, but it has a global interpreter lock, which means that each time a Ruby 1.9 thread is running it will prevent other Ruby threads from running, effectively making it the same multicore-wise as 1.8. This is also explained in an earlier Igvita article, Concurrency is a Myth in Ruby.
On the bright side, not all is bad. Ruby 1.8 internally uses non-blocking I/O while Ruby 1.9 unlocks the global interpreter lock while doing I/O. So if one Ruby thread is blocked on I/O, another Ruby thread can continue execution. Likewise, Ruby is smart enough to cause things like sleep() and even waitpid() to preempt to other threads.
On the dark side however, Ruby internally uses the select() system call for multiplexing I/O. select() can only handle 1024 file descriptors on most systems so Ruby cannot handle more than this number of sockets per Ruby process, even if you are somehow able to spawn thousands of Ruby threads. EventMachine works around this problem by bypassing Ruby’s I/O code completely.
Naive native extensions and third party libraries
So just run a couple of multi-threaded Ruby processes, one process per core and multiple threads per process, and all is fine and we should be able to have a concurrency level of up to a couple hundred, right? Well not quite, there are a number of issues hindering this approach:
- Some third party libraries and Rails plugins are not thread-safe. Some aren’t even reentrant. For example Rails < 2.2 suffered from this problem. The app itself might not be thread-safe.
- Although Ruby is smart enough not to let I/O block all threads, the same cannot be said of all native extensions. The MySQL extension is the most infamous example: when executing queries, other threads cannot run.
Mongrel is actually multi-threaded but in practice everybody uses in multi-process mode (mongrel_cluster) exactly because of these problems. It is also the reason why Phusion Passenger has also gone the multi-process route.
And even though Thin is evented, a typical Ruby web application running on Thin cannot handle thousands of concurrent users. This is because evented servers typically require a special evented programming style, such as the one seen in Node.js and EventMachine. A Ruby web app that is written in an evented style running on Thin can definitely handle a large number of concurrent users.
When is limited application server concurrency actually a problem?
Igvita is clearly disappointed at all all the issues that hinder Ruby web apps from achieving high concurrency. For many web applications I would however argue that limited concurrency is not a problem.
- Web applications that are slow, as in CPU-heavy, max out CPU resources pretty quickly so increasing concurrency won’t help you.
- Web applications that are fast are typically quick enough at handling the load so that even large number of users won’t notice the limited concurrency of the server.
Having a concurrency of 5 does not mean not mean that the app server can only handle 5 requests per second; it’s not hard to serve hundreds of requests per second with only a couple of single-threaded processes.
The problem becomes most evident for web applications that have to wait a lot for I/O (besides its own HTTP request/response cycle). Examples include:
- Apps that have to spend a lot of time waiting on the database.
- Apps that perform a lot of external HTTP calls that respond slowly.
- Chat apps. These apps typically have thousands of users, most of them doing nothing most of the time, but they all require a connection (unless your app uses polling, but that’s a whole different discussion).
We at Phusion have developed a number of web applications for clients that fall in the second category, the most recent one being a Hyves gadget. Hyves is the most popular social network in the Netherlands and they get thousands of concurrent visitors during the day. The gadget that we’ve developed has to query external HTTP servers very often, and these servers can take 10 seconds to respond in extreme cases. The servers are running Phusion Passenger with maybe a couple tens of processes. If every request to our gadget also causes us to wait 10 seconds for the external HTTP call then we’d soon run out of concurrency.
But even suppose that our app and Phusion Passenger can have a concurrency level of a couple of thousand, all of those visitors will still have to wait 10 seconds for the external HTTP calls, which is obviously unacceptable. This is another example that illustrates the difference between scalability and performance. We had solved this problem by aggressively caching the results of the HTTP calls, minimizing the number of external HTTP calls that are necessary. The result is that even though the application’s concurrency is fairly limited, it can still comfortably serve many concurrent users with a reasonable response time.
This anecdote should explain why I believe that web apps can get very far despite having a limited concurrency level. That said, as Internet usage continues to increase and websites get more and more users, we may at some time come to a point where much a larger concurrency level is required than most of our current Ruby tools allow us to (assuming server capacity doesn’t scale quickly enough).
What was Igvita.com criticizing?
Igvita.com does not appear to be criticizing Ruby or Rails for being slow. It doesn’t even appear to be criticizing the lack of Ruby tools for achieving high concurrency. It appears to be criticizing these things:
- Rails and most Ruby web application servers don’t allow high concurrency by default.
- Many database drivers and libraries hinder concurrency.
- Although alternatives exist that allow concurrency, you have to go out of your way to find them.
- There appears to be little motivation in the Ruby community for making the entire stack of web frame work + web app server + database drivers etc scalable by default.
This is in contrast to Node.js where everything is scalable by default.
Do I understand Igvita’s frustration? Absolutely. Do I agree with it? Not entirely. The same thing that makes Node.js so scalable is also what makes it relatively hard to program for. Node.js enforces a callback style of programming and this can eventually make your code look a lot more complicated and harder to read than regular code that uses blocking calls. Furthermore, Node.js is relatively young – of course you won’t find any Node.js libraries that don’t scale! But if people ever use Node.js for things other than high-concurrency servers apps, then non-scalable libraries will at some time pop up. And then you will have to look harder to avoid these libraries. There is no silver bullet.
That said, all would be well if at least the preferred default stack can handle high concurrency by default. This means e.g. fixing the MySQL extension and have the fix published by upstream. The mysqlplus extension fixes this but for some reason their changes aren’t accepted and published by the original author, and so people end up with a multi-thread-killing database driver by default.
Is Node.js innovative? Is Ruby lacking innovation?
A minor gripe that I have with the article is that Igvita calls Node.js innovative while seemingly implying that the Ruby stack isn’t innovating. Evented servers like Node.js actually have been around for years and the evented pattern is well-known long before Ruby or Javascript have become popular. Thin is also evented and predates Node.js by several years. Thin and EventMachine also allow Node.js-style evented programming. The only innovation that Node.js brings, in my opinion, is the fact that it’s Javascript. The other “innovation” is the lack of non-scalable libraries.
Conclusion
Igvita appears to be criticizing something other than Rails performance, as his article’s title would imply.
I don’t think the concurrency levels that the Rails stack provides by default is that bad in practice. But as a fellow programmer, it does intuitively bother me that our laptops, which are a million times more powerful than supercomputers from two decades ago, cannot comfortably handle a couple of thousand concurrent users. We can definitely work towards something better, but in the mean time let’s not forget that the current stack is more than capable of Getting Work Done(tm).
Securely store passwords with bcrypt-ruby; now compatible with JRuby and Ruby 1.9
When writing web applications, or any application for that manner, any passwords should be stored securely. As a rule of thumb, one should never store passwords as clear text in the database for the following reasons:
- If the database ever gets leaked out, then all accounts are compromised until every single user resets his password. Imagine that you’re an MMORPG developer; leaking out the database with clear text passwords allows the attacker to delete every player’s characters.
- Many people use the same password for multiple sites. Imagine that the password stored in your database is also used for the user’s online banking account. Even if the database does not get leaked out, the password is still visible to the system administrator; this can be a privacy breach.
There are several “obvious” alternatives, which aren’t quite secure enough:
- Storing passwords as MD5/SHA1/$FAVORITE_ALGORITHM hashes
- These days MD5 can be brute-force cracked with relatively little effort. SHA1, SHA2 and other algorithms are harder to brute-force, but the attacker can still crack these hashes by using rainbow tables: precomputed tables of hashes with which the attacker can look up the input for a hash with relative ease. This rainbow table does not have to be very large: it just has to contain words from the dictionary, because many people use dictionary words as passwords.
Using plain hashes also makes it possible for an attacker to determine whether two users have the same password.
- Encrypting the password
- This is not a good idea because if the attacker was able to steal the database, then there’s a possibility that he’s able to steal the key file as well. Plus, the system administrator is able to read everybody’s passwords, unless he’s restricted access to either the key file or the database.
The solution is to store passwords as salted hashes. One calculates a salted hash as follows:
salted_hash = hashing_algorithm(salt + cleartext_password)
Here, salt is a random string. After calculating the salted hash, one should store the salted hash in the database, along with the (cleartext) salt. It is not necessary to keep the salt secret or to obfuscate it.
When a user logs in, one can verify his password by re-computing the salted hash and comparing it with the salted hash in the database:
salted_hash = hashing_algorithm(salt_from_database + user_provided_password)
if (salted_hash == salted_hash_from_database):
user is logged in
else:
password incorrect
The usage of the salt forces the attacker to either brute-force the hash or to use a ridiculously large rainbow table. In case of the latter, the sheer size of the required rainbow table can make it unpractical to generate. The larger the salt, the more difficult it becomes for the cracker to use rainbow tables.
However, even with salting, one should still not use SHA1, SHA2, Whirlpool or most other hashing algorithms because these algorithms are designed to be fast. Although brute forcing SHA2 and Whirlpool is hard, it’s still possible given sufficient resources. Instead, one should pick a hashing algorithm that’s designed to be slow so that brute forcing becomes unfeasible. Bcrypt is such a slow hashing algorithm. A speed comparison on a MacBook Pro with 2 Ghz Intel Core 2 Duo:
- SHA-1: 118600 hashes per second.
- Bcrypt (with cost = 10): 7.7 hashes per second.
Theoretically it would take 4*10^35 years for a single MacBook Pro core to crack an SHA-1 hash, assuming that the attacker does not harness any weaknesses in SHA-1. To crack a bcrypt hash one would need 6*10^39 years, or 10000 more times. Therefore, we recommend the use of bcrypt to store passwords securely.
There’s even a nice Ruby implementation of this algorithm: bcrypt-ruby! Up until recently, bcrypt-ruby was only available for MRI (“Matz Ruby Interpreter”, the C implementation that most people use). However, we’ve made it compatible with JRuby! The code can be found in our fork at Github. The current version also has issues with Ruby 1.9, which we’ve fixed as well. The author of bcrypt-ruby has already accepted our changes and will soon release a new version with JRuby and Ruby 1.9 support.
Further recommended reading
How to Safely Store a Password by Coda Hale.
Getting ready for Ruby 1.9.1
We are excited about Ruby 1.9.1. Of course, with all the performance improvements, who wouldn’t be? Unfortunately a large number of Ruby libraries and extensions still don’t work on 1.9.1, so Ruby 1.9 cannot be considered production-ready yet. Ryan Bigg has done an excellent job on documenting most of the problems that one would encounter when trying to get a basic Rails app up-and-running on Ruby 1.9.1. Basically, the problems he countered were:
- 2.2.2 isn’t compatible with 1.9.1. Use Rails 2.3.0 RC1 or Rails edge.
- The mysql gem needs patching.
- The hpricot gem needs patching.
- The postgres gem needs patching.
- Thin needs patching.
- The fastthread gem needs patching.
- Mongrel needs patching.
But what about Phusion Passenger? Good news:
Phusion Passenger is Ruby 1.9.1-compatible since this commit (today).
Here’s a screenshot of a Rails 2.3.0 app running in Phusion Passenger on Ruby 1.9.1:
Do you see the changes? Me neither. That’s the point.
We’ve encountered the following issues upon trying to get a simple Rails 2.3 app up running with Phusion Passenger and Ruby 1.9.1:
- Fastthread isn’t compatible with 1.9
- Both Mongrel and Phusion Passenger depend on Fastthread, which is a threading library that fixes some threading implementation bugs in older versions of Ruby 1.8. Fastthread is only a required dependency when running on older versions of Ruby 1.8. Unfortunately there’s no way to tell RubyGems “we depend on fastthread, but only when running on older versions of Ruby 1.8, and not on JRuby”.
Fastthread doesn’t compile on Ruby 1.9 (or on JRuby or other Ruby implementations for that matter), so when you type “gem install passenger” or “gem install mongrel” on Ruby 1.9, the installation fails with a ton of compile errors.
We’ve patched fastthread so that it becomes a no-op on Ruby 1.9.1 and on JRuby (that is, fastthread will install correctly but it won’t do anything). These patches have been submitted to Mentalguy, the maintainer of fastthread.
- The sqlite3-ruby gem doesn’t work on 1.9
- Jeremy Kemper submitted a 1.9 compatibility patch in the past, which had been committed. Unfortunately even with this patch, sqlite3-ruby isn’t compatible with 1.9.1.
We’ve gone ahead and fixed 1.9.1 support. The patch can be found here: http://rubyforge.org/tracker/index.php?func=detail&aid=23792&group_id=254&atid=1045
Hongli Lai
|
Ninh Bui
|


Those who administer production Unix systems have undoubtedly encountered the problem of frozen processes before. They just sit around, consuming CPU and/or memory indefinitely until you forcefully shut them down.





Hongli Lai
Phusion. All rights reserved.