Showing posts with label gunicorn. Show all posts

Tuesday, December 29, 2015

Issues with running as PID 1 in a Docker container.

We are getting close to the end of this initial series of posts on getting IPython to work with Docker and OpenShift. In the last post we finally got everything working in plain Docker when a random user ID was used and consequently also under OpenShift.

Although we covered various issues and had to make changes to the existing ‘Dockerfile’ used with the ‘jupyter/notebook’ image to get it all working correctly, there was one issue that the Docker image for ‘jupyter/notebook’ had already addressed which needs a bit of explanation. This related to the existing ‘ENTRYPOINT’ statement used in the ‘Dockerfile’ for ‘jupyter/notebook’.

ENTRYPOINT ["tini", "--"]
CMD ["jupyter", "notebook"]

Specifically, the ‘Dockerfile’ was wrapping the running of the ‘jupyter notebook’ command with the ‘tini’ command.

Orphaned child processes

For a broader discussion on the problem that the use of ‘tini’ is trying to solve you can read the post ‘Docker and the PID 1 zombie reaping problem’.

In short though, process ID 1, which is normally the UNIX ‘init’ process, has a special role in the operating system. That is that when the parent of a process exits prior to its child processes, and the child processes therefore become orphans, those orphaned child processes have their parent process remapped to be process ID 1. When those orphaned processes then finally exit and their exit status is available, it is the job of the process with process ID of 1, to acknowledge the exit of the child processes so that their process state can be correctly cleaned up and removed from the system kernel process table.

If this cleanup of orphaned processes does not occur, then the system kernel process table will over time fill up with entries corresponding to the orphaned processes which have exited. Any processes which persist in the system kernel process table in this way are what are called zombie processes. They will remain there so long as no process performs the equivalent of a system ‘waitpid()’ call on that specific process to retrieve its exit status and so acknowledge that the process has terminated.

Process ID 1 under Docker

Now you may be thinking, what does this have to do with Docker, after all, aren’t processes running in a Docker container just ordinary processes in the operating system, but simply walled off from the rest of the operating system.

This is true, and if you were to run a Docker container which executed a simple single process Python web server, if you look at the process tree on the Docker host using ‘top’ you will see:

Docker host top wsgiref idle

Process ID ‘26196’ here actually corresponds to the process created from the command that we used as the ‘CMD’ in the ‘Dockerfile’ for the Docker image.

Our process isn’t therefore running as process ID 1, so why is the way that orphaned processes are handled even an issue?

The reason is that if we were to instead look at what processes are running inside of our container, we can only see those which are actually started within the context of the container.

Further, rather than those processes using the same process ID as they are really running as when viewed from outside of the container, the process IDs have been remapped. In particular, processes created inside of the container, when viewed from within the container, have process IDs starting at 1.

Thus the very first process created due to the execution of what is given by ‘CMD’ will be identified as having process ID 1. This process is still though the same as identified by process ID ‘26196’ when viewed from the Docker host.

More importantly, what you cannot see from with inside of the container is what was the original process with the process ID of ‘1’ outside of the container. That is, you cannot see the system wide ‘init’ process.

Logically it isn’t therefore possible to reparent an orphaned process created within the container to a process not even visible inside of the container. As such, orphaned processes are reparented to the process with process ID of ‘1’ within the container. The obligation of reaping the resulting zombie processes therefore falls to this process and not the system wide ‘init’ process.

Testing for process reaping

In order to delve more into this issue and in particular its relevance to when running a Python web server, as a next step lets create a simple Python WSGI application which can be used to trigger orphan processes. Initially we will use the WSGI server implemented by the ‘wsgiref’ module in the Python standard library, but we can also run it up with other WSGI servers to see how they behave as well.

from __future__ import print_function

import os

def orphan():
    print('orphan: %d' % os.getpid())
    os._exit(0)

def child():
    print('child: %d' % os.getpid())
    newpid = os.fork()
    pids = (os.getpid(), newpid)
    if newpid == 0:
        orphan()
    else:
        pids = (os.getpid(), newpid)
        print("child: %d, orphan: %d" % pids)
        os._exit(0)

def parent():
     newpid = os.fork()
     if newpid == 0:
         child()
     else:
         pids = (os.getpid(), newpid)
         print("parent: %d, child: %d" % pids)
         os.waitpid(newpid, 0)

def application(environ, start_response):
    status = '200 OK'
    output = b'Hello World!'
    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]

    start_response(status, response_headers)

    parent()

    return [output]

from wsgiref.simple_server import make_server

httpd = make_server('', 8000, application)
httpd.serve_forever()

The way the test runs is that each time a web request is received, the web application process will fork twice. The web application process itself will be made to wait on the exit of the child process it created. That child process though will not wait on the further child process it had created, thus creating an orphaned process as a result.

Building this test application into a Docker image, with no ‘ENTRYPOINT’ defined and only a ‘CMD’ which runs the Python test file application, when we hit it with half a dozen requests, what we then see from inside of the Docker container is:

Docker container top wsgiref multi

For a WSGI server implemented using the ‘wsgiref’ module from the Python standard library, this indicates that no reaping of the zombie process is occurring. Specifically, you can see how our web application process running as process ID ‘1’ now has various child processes associated with it where the status of each process is ‘Z’ indicating it is a zombie process waiting to be reaped. Even if we wait some time, these zombie processes never go away.

If we look at the processes from the Docker host we see the same thing.

Docker host top wsgiref multi

This therefore confirms what was described, which is that the orphaned processes will be reparented against what is process ID ‘1’ within the container, rather than what is process ID ‘1’ outside of the container.

One thing that is hopefully obvious is that a WSGI server based off the ‘wsgiref’ module sample server in the Python standard library doesn’t do the right thing, and running it as the initial process in a Docker container would not be recommended.

Behaviour of WSGI servers

If a WSGI server based on the ‘wsgiref’ module sample server isn’t okay, what about other WSGI servers. Also, what about ASYNC web servers for Python such as Tornado.

The outcome from running the test WSGI application on the most commonly used WSGI servers, and also equivalent tests specifically for the Tornado ASYNC web server, Django and Flask builtin servers, yields the following results.

django (runserver) - FAIL
flask (builtin) - FAIL
gunicorn - PASS
Apache/mod_wsgi - PASS
tornado (async) - FAIL
tornado (wsgi) - FAIL
uWSGI - FAIL
uWSGI (master) - PASS
waitress - FAIL
wsgiref - FAIL

The general result here is that any Python web server that runs as a single process would usually not do what is required of a process running as process ID ‘1’. This is because they aren’t in any way designed to manage child processes. As a result, there isn’t even the chance that they may look for exiting zombie processes and reap them.

Of note though, uWSGI when used with its default options, although it can run in a multi process configuration has a process management model with is arguably broken. The philosophy with uWSGI though is seemingly to never correct what it gets wrong, but to instead add an option which enables the correct behaviour. Thus users have to opt into the correct or better behaviour. For the case of uWSGI, the more robust process management model is only enabled by using the ‘--master’ option. If using uWSGI you should always use that option, regardless of whether you are running it in Docker or not.

Both uWSGI in master mode and mod_wsgi, although they pass and will reap zombie processes when run as process ID ‘1’, work in a way that can be surprising.

The issue with uWSGI in master mode and mod_wsgi, is that each only look for exiting child processes on a periodic basis. That is, they will wake up about once a second and then look for any child processes that have exited, collecting their exit status and so for zombie processes cause them to be reaped.

This means that during the one second interval, some number of zombie processes still could accumulate, the number depending on request throughput and how often a specific request does something that would trigger the creation of a zombie process. The number of zombie processes will therefore build up and then be brought back to zero each second.

Although this occurs for uWSGI in master mode and mod_wsgi, it shouldn’t in general cause an issue as no other significant code runs in the parent or master process which is managing all the child processes. Thus the presence of the zombie process as a child for a period will not cause any confusion. Further, zombie processes should still be reaped at an adequate rate, so temporary increases shouldn’t matter.

Problems which can arise

As to what problems can actually arise due to this issue, there are a few at least.

The first is that if the process running as process ID ‘1’ does not reap zombie processes, then they will accumulate over time. If the container is for a long running service, then eventually the available slots in the system kernel process table could be used up. If this were to occur, the system as a whole would be unable to create any new processes.

How this plays out in practice within a Docker container I am not sure. If it were the case that the upper bound of the number of such zombie processes that could be created within a Docker container were bounded by the system kernel process table size, then technically the creation of zombie processes could be used as an attack vector against the Docker host. I sort of expect therefore that Docker containers likely have some lower limit on the number of process that can be created within the container, although things get complicated if a specific user has multiple containers. Hopefully someone can clarify this specific point for me.

The second issue is that the reparenting of processes against the application process running as process ID ‘1’ could confuse any process management mechanism running within that process. This could cause issues in a couple of ways.

For example, if the application process were using the ‘wait()’ system call to wait for any child process exiting, but the reported process ID wasn’t one that it was expecting and it didn’t handle that gracefully, it could cause the application process to fail in some way. Especially in the case where the ‘wait()’ call indicated that an exiting zombie process had a non zero status, it may cause the application process to think its directly managed child processes were having problems and failing in some way. Alternatively, if the orphaned processes weren't themselves exiting straight away, and the now parent process operated in some way by monitoring the set of child processes it had, then this itself could be confusing the parent process.

Finally getting back to the IPython example we have been working with, it has been found that when running the ‘jupyter notebook’ application as process ID ‘1’, it fails to start up properly kernel processes for running of individual notebook instances. The logged messages in this case are:

[I 10:19:33.566 NotebookApp] Kernel started: 1ac58cd9-c717-44ef-b0bd-80a377177918
[I 10:19:36.566 NotebookApp] KernelRestarter: restarting kernel (1/5)
[I 10:19:39.573 NotebookApp] KernelRestarter: restarting kernel (2/5)
[I 10:19:42.582 NotebookApp] KernelRestarter: restarting kernel (3/5)
[W 10:19:43.578 NotebookApp] Timeout waiting for kernel_info reply from 1ac58cd9-c717-44ef-b0bd-80a377177918
[I 10:19:45.589 NotebookApp] KernelRestarter: restarting kernel (4/5)
WARNING:root:kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 restarted
[W 10:19:48.596 NotebookApp] KernelRestarter: restart failed
[W 10:19:48.597 NotebookApp] Kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 died, removing from map.
ERROR:root:kernel 1ac58cd9-c717-44ef-b0bd-80a377177918 restarted failed!
[W 10:19:48.610 NotebookApp] Kernel deleted before session

I have been unable to find that anyone has been able to work out the specific cause, but I suspect it is falling foul of the second issue above. That is, the exit statuses from those orphaned processes are confusing the code managing the startup of the kernel processes, making it think the kernel processes are in fact failing, causing it to attempt to restart them repeatedly.

Whatever the specific reason, not running the ‘jupyter notebook’ as process ID ‘1’ avoids the problem, so it does at least appear to be related to the orphaned processes being reparented against the main ‘jupyter notebook’ process.

Now although for IPython it seems to relate to the second issue whereby process management mechanisms are failing, as shown above, even generic Python WSGI servers or web servers don’t necessarily do the right thing either. So even though they might not have process management issues, since they don’t perform any such management of processes for implementing a multi process configuration for the server itself, the accumulation of zombie process could still eventually cause the maximum number of allowed processes to be exceeded.

Shell as parent process

Ultimately the solution is not to run any application process not designed to also perform reaping of child processes as process ID ‘1’ inside of the container.

There are two ways to avoid this. The first is a quick hack and one which is often seen used in Docker containers, although perhaps not intentionally. Although it avoids the zombie reaping problem, it causes its own issues.

The second way is to run as process ID ‘1’ a minimal process whose only role is to execute as a child process the real application process and then subsequently reap the zombie processes.

This minimal init process of the second approach has one other important role as well though and it is this role where the quick hack solution fails.

As to the quick or inadvertent hack that some rely on, lets look at how a ‘CMD’ in a ‘Dockerfile’ is specified.

The recommended way of using ‘CMD’ in a ‘Dockerfile’ would be to write:

CMD [ "python", "server_wsgiref.py" ]

This is what was used above where we saw within the Docker container.

As has already been explained, this results in our application running as process ID ‘1’.

Another way of using ‘CMD’ in a ‘Dockerfile’ is to write:

CMD python server_wsgiref.py

Our application still runs, but this isn’t doing the same thing as when we supplied a list of arguments to ‘CMD’.

The result in this case is:

Docker container top wsgiref shell

With this way of specifying the ‘CMD’ our application is no longer running as process ID ‘1’. Instead process ID ‘1’ is occupied by an instance of ‘/bin/sh’.

This has occurred because supplying the plain command line to ‘CMD’ actually results in the equivalent of:

CMD [ "sh", "-c", "python server_wsgiref.py" ]

Thus the reason for a shell process being introduced into the process hierarchy as process ID ‘1’.

With our application now no longer running as process ID ‘1’, the responsibility of reaping zombie processes falls instead to the instance of ‘/bin/sh’ running as process ID ‘1’.

As it turns out, ‘/bin/sh’ will reap any child processes associated with it, so we do not have the problem of zombie processes accumulating.

Now this isn’t the only way you might end up with an instance of ‘/bin/sh’ being process ID ‘1’.

Another common scenario where this ends up occurring is where someone using Docker uses a shell script with the ‘CMD’ statement so that they can do special setup prior to actually running their application. You thus can often find something like:

CMD [ "/app/start.sh" ]

The contents of the ’start.sh’ script might then be:

#!/bin/sh

python server_wsgiref.py

Using this approach, what we end up with is:

Docker container top wsgiref entrypoint

Our script is listed as process ID ‘1’, although it is in reality still an instance of ‘/bin/sh’.

The reason our application didn’t end up as process ID ‘1’ in this case is that the final line of the script simply said ‘python server_wsgiref.py’.

Whenever using a shell script as a ‘CMD’ like this, you should always ensure that when running your actual application from the shell script, that you do so using ‘exec’. That is:

#!/bin/sh

exec python server_wsgiref.py

By using ‘exec’ you ensure that your application process takes over and replaces the script process, thus resulting in it running as process ID ‘1’.

But wait, if having process ID ‘1’ be an instance of ‘/bin/sh’, with our application being a child process of it solves the zombie reaping problem, why not always do that then.

The reason for this is that although ‘/bin/sh’ will reap zombie processes for us, it will not propagate signals properly.

For our example, what this is means is that with ‘/bin/sh’ as process ID ‘1’, if we were using the command ‘docker stop’, the application process will not actually shutdown. Instead the default timeout for ‘docker stop’ will expire and it will then do the equivalent of ‘docker kill’ which will force kill the application and the container.

This occurs because although the instance of ‘/bin/sh’ will receive the signal to terminate the application which is sent by ‘docker stop', it ignores it and doesn’t pass it on to the actual application.

This in turn means that your application is denied the ability to be notified properly that the container is being shutdown and so ensure that it performs any required finalisation of in progress operations. For some applications, this lack of an ability to perform a clean shutdown could leave any persistent data in an inconsistent state, causing problems when the application is restarted.

It is therefore important that signals always be received by the main application process in a Docker container, but an intermediary shell process will not ensure that.

One can attempt to catch signals in the shell script and forward them on, but this does get a bit tricky as you also have to ensure that you wait for the wrapped application process to shutdown properly when it is passed a signal that would cause it to exit. As I have previously shown in an earlier post for other reasons, you might be able to use in such circumstances the shell script:

#!/bin/sh

trap 'kill -TERM $PID' TERM INT

python server_wsgiref.py &

PID=$!
wait $PID
trap - TERM INT
wait $PID
STATUS=$?

exit $STATUS

To be frank though, rather than hoping this will work reliably, you are better off using a purpose built monitoring process for this particular task.

Minimal init process

Coming from the Python world, one solution that Python developers like to use for managing processes is ‘supervisord’. This should work, but is a relatively heavy weight solution. At this time, ‘supervisord’ is also still only usable with Python 2. If you were wanting to run an application using Python 3, this means you wouldn’t be able to use it, unless you were okay with having to also add Python 2 to your image, resulting in a much fatter Docker image.

The folks at Phusion in that blog post I referenced earlier do provide a minimal ‘init’ like process which is implemented as a Python script, but if not using Python at all in your image, that means pulling in Python 2 once again when you perhaps don’t want that.

Because of the overheads of bringing in additional packages where you don’t necessarily want them, my preferred solution for a minimal ‘init’ process for handling reaping of zombies and the propagation of signals to the managed process is the ‘tini’ program. This is the same program that the ‘jupyter/notebook’ also makes use of and we saw mentioned in the ‘ENTRYPOINT’ statement of the ‘Dockerfile’.

ENTRYPOINT ["tini", "--"]

All ’tini' does is spawn your application and wait for it to exit, all the while reaping zombies and performing signal forwarding. In other words, it is specifically built for this task, relieving you of worrying about whether your own application is going to do the correct thing in relation to reaping of zombie processes.

Even if you believe your application may handle this task okay, I would still recommend that a tool like ‘tini’ be used as it gives you one less thing to worry about.

If you are using a shell script with ‘CMD’ in a ‘Dockerfile’ and subsequently running your application from it, you can still do that, but remember to use ‘exec’ when running your application to ensure that signals will get to your application. Don’t use ‘exec’ and your shell script will still swallow them up.

IPython and cloud services

We are finally done with improving on how IPython can be run with Docker so that it will work with cloud services using Docker. The main issue here we faced was the additional security restrictions that can be in place in cloud services for running Docker images in such a service.

In short, running Docker images as ‘root’ is a bad idea. Even if you are running your own Docker service it is something you should avoid if at all possible. Because of the increased risk you can understand why a hosting service is not going to allow you to do it.

With the introduction of user namespace support in Docker the restriction on what user a Docker image can run as should hopefully be able to be relaxed, but in the interim you would be wise to design Docker images so that they can run as an unprivileged user.

Now since there was actually a few things we needed to change to achieve this and a description of the changes were spread over multiple blog posts, I will summarise the changes in the next post. I will also start to outline what else I believe could be done to make the use of IPython with Docker, and especially cloud services, even better.

Tuesday, May 19, 2015

Returning a string as the iterable from a WSGI application.

The possible performance consequences of returning many separate data blocks from a WSGI application were covered in the previous post. In that post the WSGI application used as an example was one which returned the contents of a file as many small blocks of data. Part of the performance problems seen arose due to how the WSGI servers would flush each individual block of data out, writing it onto the socket connection back to the HTTP client. Flushing the data in this way for many small blocks, as opposed to as few as possible larger blocks had a notable overhead, especially when writing back to an INET socket connection.

In this post I want to investigate an even more severe case of this problem that can occur. For that we need to go back and start over with the more typical way that most Python web frameworks return response data. That is, as a single large string.

def application(environ, start_response):
    status = '200 OK'
    output = 1024*1024*b'X'

    response_headers = [('Content-type', 'text/plain'),
            ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]

In this example I have increased the length of the string to be returned so it is 1MB in size.

If we run this under mod_wsgi-express and use 'curl' then we get an adequately quick response as we would expect.

$ time curl -s -o /dev/null http://localhost:8000/

real 0m0.017s
user 0m0.006s
sys 0m0.005s

Response as an iterable

As is known, the response from a WSGI application needs to be an iterable of byte strings. The more typical scenarios are that a WSGI application would return either a list of byte strings, or it would be implemented as a generator function which would yield one or more byte strings.

That the WSGI server will accept any iterable does mean that people new to writing Python WSGI applications often make a simple mistake. That is that instead of returning a list of byte strings, they simply return the byte string instead.

def application(environ, start_response):
    status = '200 OK'
    output = 1024*1024*b'X'

    response_headers = [('Content-type', 'text/plain'),
            ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return output

Because a byte string is also an iterable then the code will still appear to run fine, returning the response expected back to the HTTP client. For a small byte string the time taken to display the page back in the browser would appear to be normal, but for a larger byte string it would become evident that something is wrong.

Running this test with mod_wsgi-express again, we can see the time taken balloon out from 17 milliseconds to over 3 seconds:

$ time curl -s -o /dev/null http://localhost:8000/

real 0m3.659s
user 0m0.294s
sys 0m1.544s

Use gunicorn, and as we saw in the previous post the results are much worse again.

$ time curl -s -o /dev/null http://localhost:8000/

real 0m23.762s
user 0m1.446s
sys 0m7.085s

The reason the results are so bad is that when a string is returned as the iterable, when iterating over it you are dealing with the byte string one byte at a time. This means that when the WSGI server is writing the response back to the HTTP client, it is in turn doing it one byte at a time. The overhead of writing one byte at a time to the socket connection is dwarfing everything else.

As was covered in the prior post, this is entirely expected behaviour for a WSGI server. The mistake here is entirely in the example code and the code should be corrected.

Acknowledging that this would be a problem with a users code, lets now though see what happens when this example is run through uWSGI.

$ time curl -s -o /dev/null http://localhost:8000/

real 0m0.019s
user 0m0.006s
sys 0m0.007s

Rather than the larger time we would expect to see, when using uWSGI the time taken is still down at 19 milliseconds.

This may seem to be a good result, but what is happening here is that uWSGI has decided to break with conforming to the WSGI specification and has added special checking to detect this sort of user mistake. Instead of treating the result as an iterable as it is required to by the WSGI specification, it takes the whole string and uses it as is.

You may still be thinking this is great, but it isn't really and will only serve to hide the original mistake, resulting in users writing and shipping code with a latent bug. The problem will only become evident when the WSGI application is run on a different WSGI server which does conform to the WSGI specification.

This could be disastrous if the WSGI application was being shipped out to numerous customers where it was the customer who decided what WSGI server they used.

In addition to there being a problem when run on a conforming WSGI server, there will also be a problem if someone took the WSGI application and wrapped it with a WSGI middleware.

This can be illustrated by now going back and adding the WSGI application timing decorator.

from timer1 import timed_wsgi_application1

@timed_wsgi_application1
def application(environ, start_response):

    status = '200 OK'
    output = 1024*1024*b'X'

    response_headers = [('Content-type', 'text/plain'),
            ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return output

With the decorator, running with uWSGI again we now get:

$ time curl -s -o /dev/null http://localhost:8000/

real 0m13.876s
user 0m0.925s
sys 0m3.407s

So only now do we get the sort of result we were expecting, with it taking 13 seconds.

The reason that the problem only now shows up with uWSGI is because the wrapper which gets added by the WSGI middleware, around the iterable passed back from the WSGI application, causes the uWSGI check for an explicit string value to fail.

This demonstrates how fragile the WSGI application now is. If it is intended to be something that is used by others, you cannot control whether those users may use any WSGI middleware around it. A user might therefore very easily and unknowingly, cause problems for themselves and have no idea why or even that there is a problem.

Application portability

As much as the WSGI specification is derided by many, it can't be denied it eliminated the problem that existed at the time with there being many different incompatible ways to host Python web applications. Adherence to the WSGI specification by both WSGI applications and servers is key to that success. It is therefore very disappointing to see where WSGI servers deviate from the WSGI specification as it is a step away from the goal of application portability.

What you instead end up with is pseudo WSGI applications which are in fact locked in to a specific WSGI server implementation and will not run correctly or perform well on other WSGI servers.

If you are developing Python web applications at the WSGI level rather than using a web framework and you value WSGI application portability, one thing you can do to try and ensure WSGI compliance is to use the WSGI validator found in the 'wsgiref.validate' module of the Python standard library.

def application(environ, start_response):
    status = '200 OK'
    output = 1024*1024*b'X'

    response_headers = [('Content-type', 'text/plain'),
            ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return output

import wsgiref.validate

application = wsgiref.validate.validator(application)

This will perform a range of checks to ensure some basic measure of WSGI compliance and also good practice.

In this particular example of where a string was returned as the WSGI application iterable, the WSGI validator will flag it as a problem, with the error:

AssertionError: You should not return a string as your application iterator, instead return a single-item list containing that string.

So run the validator on your application during development or testing at times and exercise your WSGI application to get good coverage. This will at least help with some of the main issues that may come up, although by no means all given how many odd corner cases that exist within the WSGI specification.

Effects of yielding multiple blocks in a WSGI application response.

In my last post I introduced a Python decorator that can be used for measuring the overall time taken by a WSGI application to process a request and for the response to then be sent on its way back to the client.

That prior post showed an example where the complete response was generated and returned as one string. That the response has to be returned as one complete string is not a requirement of the WSGI specification, albeit for many Python web frameworks it is the more typical scenario for responses generated as the output from a page template system.

If needing to return very large responses, generating it and returning it as one complete string wouldn't necessarily be practical. This is because doing so would result in excessive memory usage due to the need to keep the complete response in memory. This problem would be further exacerbated in a multi threaded configuration where concurrent requests could all be trying to return very large responses at the same time.

In this sort of situation it will be necessary to instead return the response a piece at a time, by returning from the WSGI application an iterable that can generate the response as it is being sent.

Although this may well not be the primary way in which responses are generated from a WSGI application, it is still an important use case. It is a use case though which I have never seen covered in any of the benchmarks that people like to run when comparing WSGI servers. Instead benchmarks focus only on the case where the complete response is returned in one go as a single string.

In this post therefore I am going to look at this use case where response content is generated as multiple blocks and see how differently configured WSGI servers perform and what is going on under the covers so as to impact the response times as seen.

Returning the contents of a file

The example I am going to use here is returning the contents of a file. There is actually an optional extension that WSGI servers can implement to optimise the case of returning a file, but I am going to bypass that extension at this time and instead handle returning the contents of the file myself.

The WSGI application being used in this case is:

from timer1 import timed_wsgi_application1

@timed_wsgi_application1
def application(environ, start_response):
    status = '200 OK'

   response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)

   def file_wrapper(filelike, block_size=8192):
       try:
          data = filelike.read(block_size)

         while data:
             yield data
             data = filelike.read(block_size)

      finally:
          try:
             data.close()
          except Exception:
             pass

   return file_wrapper(open('/usr/share/dict/words'), 128)

On MacOS X the size of the '/usr/share/dict/words' file is about 2.5MB. In this example we are going to return the data in 128 byte blocks so as to better highlight the impacts of many separate blocks being returned.

Running this example with the three most popular WSGI servers we get from a typical run:

gunicorn app:application # 714.012ms
mod_wsgi-express start-server app.py # 159.944ms
uwsgi --http 127.0.0.1:8000 --module app:application # 388.556ms

In all configurations only the WSGI server has been used, no front ends, and with each accepting requests directly via HTTP on port 8000.

What is notable from this test is the widely differing times taken by each of the WSGI servers to deliver up the same response. It highlights why one cannot rely purely on simple 'Hello World!' benchmarks. Instead you have to be cognisant of how your WSGI application delivers up its responses.

In this case if your WSGI application had a heavy requirement for delivering up large responses broken up into many separate chunks, which WSGI server you use and how you have it configured may be significant.

Flushing of data blocks

Having presented these results, lets now delve deeper into the possible reasons for the large disparity between the different WSGI servers.

The first thing to working out why there may be a difference is to understand what is actually happening when you return an iterable from a WSGI application which yields more than one data block.

The relevant part of the WSGI specification is the section on buffering and streaming. In this section it states:

WSGI servers, gateways, and middleware must not delay the transmission of any block; they must either fully transmit the block to the client, or guarantee that they will continue transmission even while the application is producing its next block. A server/gateway or middleware may provide this guarantee in one of three ways:

1. Send the entire block to the operating system (and request that any O/S buffers be flushed) before returning control to the application, OR
2. Use a different thread to ensure that the block continues to be transmitted while the application produces the next block.
3. (Middleware only) send the entire block to its parent gateway/server.

In simple terms this means that a WSGI server is not allowed to buffer the response content and must ensure that it will actually be sent back to the HTTP client immediately or at least in parallel to fetching the next data block to be sent.

In general WSGI servers adopt option (1) and will immediately send any response content on any socket back to the HTTP client, blocking until the operating system has accepted the complete data and will ensure it is sent. All three WSGI servers tested above are implemented using option (1).

For the case of there being many blocks of data, especially smaller blocks of data, there thus can be a considerable amount of overhead in having to write out the data to the socket all the time.

In this case though the same example was used for all WSGI servers, thus the number and size of the blocks was always the same and didn't differ. There must be more to the difference than just this if all WSGI servers are writing data out to the socket immediately.

As the WSGI servers are all implementing option (1), the only other apparent difference would be the overhead of the code implementing the WSGI servers themselves.

For example, gunicorn is implemented in pure Python code and so as a result could show greater overhead than mod_wsgi and uWSGI which are both implemented in C code.

Are there though other considerations, especially since both mod_wsgi-express and uWSGI are implemented as C code yet still showed quite different results?

INET versus UNIX sockets

In all the above cases the WSGI servers were configured to accept connections directly via HTTP over an INET socket connection.

For the case of mod_wsgi-express though there is a slight difference. This is because mod_wsgi-express will run up mod_wsgi under Apache in daemon mode. That is, the WSGI application will not actually be running inside of the main Apache child worker processes which are actually handling the INET socket connection back to the HTTP client.

Instead the WSGI application will be running in a separate daemon process run up by mod_wsgi and to which the Apache child worker processes are communicating as a proxy via a UNIX socket connection.

To explore whether this may account for why mod_wsgi-express shows a markedly better response time, what we can do is run mod_wsgi-express in debug mode. This is a special mode which forces Apache and mod_wsgi to run as one process, rather than the normal situation where there is an Apache parent process, Apache child worker processes and the mod_wsgi daemon processes.

This debug mode is normally used when wishing to be able to interact with the WSGI application running under mod_wsgi, such as if using the Python debugger pdb or some other interactive debugging tool which exposes a console prompt direct from the WSGI application process.

The side affect of using debug mode though means that the WSGI application is effectively running in a similar way to mod_wsgi embedded mode, meaning that when writing back a response, the data blocks will be written direct onto the INET socket connection back to the HTTP client.

Running with this configuration we get:

mod_wsgi-express start-server --debug-mode app.py # 470.487ms

As it turns out that there is a difference between daemon mode and embedded mode of mod_wsgi, now lets also consider uWSGI.

Although uWSGI is being used to accept HTTP connections directly over an INET connection, the more typical arrangement for uWSGI is to use it behind nginx. Obviously using nginx as a HTTP proxy isn't really going to help as one would see similar results as shown, but uWSGI also supports its own internal wire protocol for talking to nginx called 'uwsgi', so lets try that instead and see if that makes a difference.

In using the 'uwsgi' wire protocol though, we still have two possible choices we can make for configuring it. The first is that an INET socket connection is used between nginx and uWSGI and the second is to use a UNIX socket connection instead.

uwsgi --socket 127.0.0.1:9000 --module app:application # 284.802ms
uwsgi --socket /tmp/uwsgi.sock --module app12:application # 143.614ms

From this test we see two things.

The first is that even when using an INET socket connection between nginx and uWSGI, the time spent in the WSGI application is improved. This is most likely because of the more efficient 'uwsgi' wire protocol being used in place of the HTTP protocol. The uWSGI process is thus able to offload the response more quickly.

The second is that switching to a UNIX socket connection reduces the time spent in the WSGI application even more due to the lower overheads of writing to a UNIX socket connection compared to an INET socket connection.

Although the time spent in the WSGI application is reduced in both cases, it is vitally important to understand that this need not translate into an overall reduced response time to the same degree as seen by the HTTP client.

This applies equally to mod_wsgi when run in daemon mode. In both the case of mod_wsgi in daemon mode and uWSGI behind nginx the front end process is allowing the backend process running the WSGI application to more quickly offload the response only. It doesn't eliminate the fact that the front end represents an extra hop in the communications with the HTTP client.

In other words, where time is being spent has marginally been shifted to the front end proxy out of the WSGI application.

This doesn't mean that the effort isn't entirely wasted though. This is because WSGI applications have a constrained set of resources in the form of processes/threads for handling web requests. Thus the quicker you can offload the response from the WSGI application, the quicker the process or thread is freed up to be able to handle the next request.

Use of a front end proxy as exists with mod_wsgi in daemon mode, or where uWSGI is run behind nginx, actually allows both WSGI servers to perform more efficiently and so they can handle a greater load than they would otherwise be able to if they were dealing direct with HTTP clients.

Serving up of static files

Although we can use the WSGI application code used for this test to serve up static files, in general, serving up static files from a WSGI application is a bad idea. This is because the overheads will still be significantly more than serving up the static files from a proper web server.

To illustrate the difference, we can make use of the fact that mod_wsgi-express is actually Apache running mod_wsgi and have Apache serve up our file instead. We can do this using the command:

mod_wsgi-express start-server app.py --document-root /usr/share/dict/

What will happen is that if the URL maps to a physical file in '/usr/share/dict', then it will be served up directly by Apache. If the URL doesn't map to a file, then the request will fall through to the WSGI application, which will serve it up as before.

As we can't readily time in Apache how long a static file request takes to sufficient resolution, we will simply time the result of using 'curl' to make the request.

$ time curl -s -o /dev/null http://localhost:8000/

real 0m0.161s
user 0m0.018s
sys 0m0.062s

$ time curl -s -o /dev/null http://localhost:8000/words

real 0m0.013s
user 0m0.005s
sys 0m0.005s

Where as to serve up the static file took 161ms when served via the WSGI application, it took only 13ms when served as a static file.

The uWSGI WSGI server has a similar option for overlaying static files on top of a WSGI application.

uwsgi --http 127.0.0.1:8000 --module app:application --static-check /usr/share/dict/

Comparing the two methods using uWSGI we get:

$ time curl -s -o /dev/null http://localhost:8000/

real 0m0.381s
user 0m0.029s
sys 0m0.092s

$ time curl -s -o /dev/null http://localhost:8000/words

real 0m0.025s
user 0m0.006s
sys 0m0.009s

As with mod_wsgi-express, one sees a similar level of improvement.

If using nginx in front of uWSGI, you could even go one step further and offload the serving up of static files to nginx with a likely further improvement due to the elimination of one extra hop and nginx's known reputation for being a high performance web server.

Using uWSGI's ability to serve static files is still though a reasonable solution where it would be difficult or impossible to install nginx, such as on a PaaS.

The static file serving capabilities of mod_wsgi-express and uWSGI would in general certainly be better than pure Python options for serving static files, although how much better will depend on whether such Python based solutions make use of WSGI server extensions for serving static files in a performant way. Such extensions and how they work will be considered in a future post.

What is to be learned

The key take aways from the analysis in this post are:

A pure Python WSGI server such as gunicorn will not perform as well as C based WSGI servers such as mod_wsgi and uWSGI when a WSGI application is streaming response content as many separate data blocks.
Any WSGI server which is interacting directly with an INET socket connection back to a HTTP client will suffer from the overheads of an INET socket connection when the response consists of many separate data blocks. Use of a front end proxy which allows a UNIX socket connection to be used for the proxy connection will improve performance and allow the WSGI server to offload responses quicker, freeing up the worker process or thread sooner to handle a subsequent request.
Serving of static files is better offloaded to a separate web server, or separate features of a WSGI server designed specifically for handling of static files.

Note that in this post I only focused on the three WSGI servers of gunicorn, mod_wsgi and uWSGI. Other WSGI servers do exist and I intend to revisit the Tornado and Twisted WSGI containers and the Waitress WSGI server in future posts.

I am going to deal with those WSGI servers separately as they are all implemented on top of a core which makes use of asynchronous rather than blocking communications. Use of an asynchronous layer has impacts on the ability to properly time how long the Python process is busy handling a specific web request. These WSGI servers also have other gotchas related to their use due to the asynchronous layer. They thus require special treatment.