Resolving Web Application Resource Bottlenecks with Concurrency

This article is a general introduction into the bottlenecks of web applications and how application servers can deal with them to improve request throughput and response times. We will start off by introducing the role of an application server. In the following sections we will illustrate each of the three resource bottlenecks and explain how application servers work to resolve them. Finally there is a small section on what other criteria there are for deciding upon an application server.

The role of the application server

The fundamental role of the application server is to offer facilities that support the application with generating dynamic pages and serving them to clients. The basic tasks that this role consists of are opening a server socket, loading up the web application, parsing the incoming HTTP requests, and invoking the application through a generic interface (for Ruby this is Rack, for Python it is WSGI). Most application servers offer a lot more functionality to make life easier for developers and system administrators. Examples of features of application servers are:

Optimizing resource usage by dispatching requests concurrently to multiple workers
Ensuring robustness by managing and respawning workers
Limiting resource usage per worker
Speed up responses by utilizing caches
Protecting against failing deployments
Protecting against common HTTP level attacks
Generating metrics and events for monitoring and inspection

In the enterprise where developers often have more restricted access to community ecosystems it is also common for application servers to offer application level services such as message queues, authentication services and database connection management.

In the next section we will zoom in on the first item, optimizing resource usage and explain the ways application servers achieve this. If you would like to learn more about the full list of features our application server – Passenger has, you can read more about that on its product page.

Resolving resource bottlenecks through concurrency

The most common feature of application servers is to dispatch incoming requests to multiple instances of the application concurrently. In the next sections we will explain how this solves each of the I/O, CPU and memory bottlenecks.

The I/O bottleneck

Sketch of servers exchanging data

This is the first bottleneck most applications run into. I/O stands for "Input / Output" and in this context denotes any activity where the processor dispatches a command to an external device and then waits for the result. The external device can for example be an attached hard disk or a network interface.

The basic I/O bottleneck occurs whenever an execution thread encounters an I/O command and sits idle waiting for a result before continuing. Often the amount of time is measured in the milliseconds, which is valuable time the processor could spend doing other useful things such as working on the next request.

Resolving it is one of the main drivers for the success of the Node.JS platform which works around the bottleneck by applying the Reactor pattern to every I/O command. To achieve this pattern every I/O API in its libraries must be asynchronous.

A less intrusive way of handling the I/O problem is to run code in many threads that are scheduled on the CPU whenever they are not waiting for I/O.

Application servers such as Passenger Enterprise and Puma can use the technique of distributing requests over many threads to deal with the I/O bottleneck. You can see in the graph below, how the requests throughput of an example application that performs I/O improves when the number of threads is increased.

In this benchmark an application endpoint was stressed by wrk for 30 seconds per concurrency level. The concurrency levels range from 1 to 128 concurrent connections. Note that this benchmark is too short to give us an accurate number, but it will clearly show order of magnitude differences which is what we are looking for. From the graph we can tell that increasing the amount of threads in our application server enables it to handle more concurrent requests.

The next section will show that running requests in threads is not the final solution however.

The CPU bottleneck

Drawing of a CPU chip

I/O is not the only slow action a web application might perform. Complex calculations, view rendering, and input parsing are examples of procedures that might not involve I/O but can still keep a CPU core busy for an extended amount of time.

Modern CPUs are equipped with many execution cores, designed to run these time consuming operations in parallel. If the application server could schedule each request handling thread to an idle CPU core then it would make optimal use of these execution cores.

Unfortunately there is a problem that prevents our applications from running threads in parallel. To guard against dangerous but subtle concurrency problems most scripting language runtimes explicitly disable threads from being run in parallel. These runtimes include MRI, the standard Ruby interpreter; CPython, the standard Python interpreter; and the standard Node.JS interpreter.

Since running requests in separate threads does not adequately deal with the CPU bottleneck, application servers such as Passenger and Unicorn manage multiple worker processes, distributing requests over them. In this configuration the CPU bottleneck is resolved as you can see in the graph below in which the requests throughput of an example application with a CPU bottleneck is benchmarked.

In this benchmark once again an application endpoint was stressed by wrk for 30 seconds per concurrency level. The concurrency levels range from 1 to 128 concurrent connections. From the graph we can tell that increasing the amount of threads in our application server does not improve the performance, it can actually slightly diminish the performance. Increasing the amount of processes however does significantly improve the requests throughput.

In the next section we will show that there is a third bottleneck that is not dealt with in the previous sections.

Memory

Drawing of a memory module

When a new process is spawned the interpreter, standard library and all dependencies of your application are loaded into memory. Especially for large applications with many responsibilities the amount of memory used can be quite significant, even with the low price of memory today.

Threads inside a single process all have access to the same memory, so they can be efficient with regards to memory use. Forking application servers such as Passenger and Unicorn can save some memory over plainly spawning multiple processes by loading as much dependencies as possible, and then using the fork system call to spawn a process. This technique enables the spawned process to refer to the original process' memory as copy-on-write. Real world usage has shown that memory savings of 20% to 25% can realistically be expected.

Given these characteristics we can conclude that to make optimal use of all our I/O, CPU and memory resources we have to balance the use of both threads and processes. Making this balancing act simple and effective is one of the core value propositions of Passenger Enterprise. It can dispatch requests to multiple threads and spawn multiple processes with memory savings through copy-on-write. All of this can be implemented and optimally balanced just by tweaking a few configuration variables. You can see how a well configured Passenger Enterprise instance can make more optimal use of the hardware in the graph below.

In this benchmark once again an application endpoint was stressed by wrk for 30 seconds per concurrency level. The concurrency levels range from 1 to 128 concurrent connections. To simulate a memory shortage we have limited the amount of processes to 16. From the chart we can tell adding multiple threads to every process results in a significant performance improvement.

Criteria for choosing an application server

As we have said in the beginning of this article, resource optimisation is not the only feature of the application server. In most cases it is not the most important criterium on which to base your choice of application server; as other qualities such as robustness, reliability and security are more important to your business than saving some money on less efficiently used hardware.

Phusion's Passenger application server aims to have top scores in each aspect. Passenger has been under active development for over eight years. In addition Phusion offers premium support from the engineers that built the application server, to ensure your business has a great knowledge base to fall back on when deploying with Passenger.

Conclusion

In this article we described the basic functionality of an application server, and what other roles it can fulfill to make life easier for developers and system administrators. After that we dove deep into one specific role: optimizing resource usage by dispatching requests to concurrent workers. We explained what different strategies there are and how they deal with the I/O, CPU and memory bottlenecks. Finally we underlined that resource usage optimisation is most often not the most important feature of an application server, as the largest business value lies in other criteria such as robustness, reliability and security.

We hope you have enjoyed this article and came away with a deeper understanding of application server concepts. If you have any feedback or questions please let us know in the comments below or on Hacker News or Reddit.

The role of the application server