[Web-SIG] ngx.poll extension (was Re: Are you going to convert Pylons code into Python 3000?)

Fri Mar 7 10:16:39 CET 2008

Graham Dumpleton ha scritto:
> On 06/03/2008, Manlio Perillo <manlio_perillo at libero.it> wrote:
>>  But I have to say that:
>>
>>  1) the asynchronous model is the "right" model to use to develope
>>     robust and scalable applications (expecially in Python).
> 
> No it isn't. It is one model, it is not necessarily the 'right' model.
> 

Ok.

> The asynchronous model actually has worse drawbacks than the GIL
> problem when multithreading is used and you have a multi core or multi
> process system. This is because in an asynchronous system with only a
> single thread, it is theoretically impossible to use more than one
> processor at a time. 

This is the reason why I'm using Nginx instead of Twisted.

> Even with the Python GIL as a contention point,
> threads in C extension modules can at least release the GIL and
> perform work in parallel and so theoretically the process can consume
> the resources of more than one core or processor at a time.
> 
> The whole nature of web applications where requests perform small
> amounts of work and then complete actually simplifies the use of
> multithreading. 

Yes, this is true most of the time.

But the reason I have finally added the poll extension in my WSGI 
implementation for Nginx is that I have some requests that *do not* take 
small amounts of work to be served.

Database queries, as an example, are not a problem if executed 
synchronously, since Nginx has multiple worker processes, and the 
environment is "controlled" (that is, I can optimize the query/database, 
the connection is on the localhost or on a LAN, and so on).

> This is because unlike complex applications where
> there are long running activities occurring in different threads there
> is no real need for the threads handling different requests to
> communicate with each other. Thus the main problem is merely
> protecting concurrent access to shared resources. Even that is not so
> bad as each request handler is mostly operating on data specific to
> the context of that request rather than shared data.
> 

Again, this is true.
However the problem is that multithreaded servers usually does not 
scales well as asynchronous one.
http://blog.emmettshear.com/post/2008/03/03/Dont-use-Pound-for-load-balancing

Of course this is special case, a server that is mostly I/O bound.

> Thus, whether one uses multithreading or an event driven system, one
> can't but avoid use of multiple processes to build a really good
> scalable system. This is where nginx limits itself a bit as the number
> of worker processes is fixed, whereas with Apache it can create
> additional worker processes if demand requires and reap them when no
> longer required. 

Right.
But this is a subject that needs more discussion (and I suspect that we 
are going off topic).

Is it true that Apache can spawn additional processes, but (again, when 
the request is mainly I/O bound) each process does very little work 
*but* using not little amount of system resources.

Nginx instead use a fixed (and small) number of processes, but each 
process is used at 100%.

Apache model is great when you need to run generic embedded applications.

I think that Nginx is great for serving static content, proxing, and 
serving embedded application that are written with the asynchronous 
nature of Nginx in mind.

> You can therefore with Apache factor in some slack to
> cope with bursts in demand and it will scale up the number of
> processes as necessary. With nginx you have to have a really good idea
> in advance of what sort of maximum load you will need to handle as you
> need to fix the number of worker processes. 

Right.

> For static file serving
> the use of an event driven system may make this easier, 

By the way, I know there is an event based worker in Apache.
Have you exterience with it?

> but factor in
> a Python web application where each request has a much greater
> overhead and possibility of blocking and it becomes a much tricker
> proposition to plan how many worker processes you may need.
>

Right.

> No matter what technology one uses there will be such trade offs and
> they will vary depending on what you are doing. Thus it is going to be
> very rare that one technology is always the "right" technology. Also,
> as much as people like to focus on raw performance of the web server
> for hosting Python web applications, in general the actual performance
> matters very little in the greater scheme of things (unless your
> stupid enough to use CGI). This is because that isn't where the
> bottlenecks are generally going to be. Thus, that one hosting solution
> may for a hello world program be three times faster than another,
> means absolutely nothing if that ends up translating to less than 1
> percent throughput when someone loads and runs their mega Python
> application. This is especially the case when the volume of traffic
> the application receives never goes any where near fully utilising the
> actual resources available. For large systems, you would never even
> depend on one machine anyway and load balance across a cluster. Thus
> the focus by many on raw speed in many cases is just plain ridiculous
> as there is a lot more to it than that.
> 

There is not only the problem on raw speed.
There is also a problem of server resources usage.

As an example, an Italian hosting company poses strict limits on 
resource usage for each client.

They do not use Apache, since they fear that serving embedded 
applications limits their control (but, if I'm not wrong, you have 
implemented a solution for this problem in mod_wsgi).

Using Nginx + the wsgi module has the benefit to require less system 
resources than flup (as an example) and, probabily, Apache.

> Graham
> 

Manlio Perillo