[Web-SIG] CPU cache locality.
Alan Kennedy
py-web-sig at xhaus.com
Thu Sep 9 13:20:51 CEST 2004
[Alan Kennedy]
>> 3. Welsh's architecture discusses isolation of multiple IO subsystems
>> into different thread groups. For example, there could be a thread
>> group holding a pool of (blocking) database connections, which would
>> be the appropriate "width" to process as many requests as can be
>> concurrently supported by the RDBMS. Since there are blocking
>> sockets/pipes/fifos between the application and the database, such
>> database operations also count as a form of IO, which has to be
>> managed. It could potentially be managed in an asynchronous fashion.
>> Do any of the cpython frameworks support an asynchronous database API?
[Jp Calderone]
> Yes, http://twistedmatrix.com/documents/current/howto/enterprise
Thanks for the reply Jp.
I've been thinking further about multi-threading, CPU cache locality and
iterators. While I was thinking about it in relation to twisted
enterprise at first, it's really an issue that applies to WSGI as well.
But let's take twisted enterprise as an example. I'm not intimately
familiar with Twisted, so please forgive me if I get something wrong.
So twisted has a pool of threads which carries out synchronous database
operations on behalf of clients, but in an asynchronous manner from the
clients perspective. This is done by receiving the "database requests"
from a queue, processing each synchronously using blocking DB-API calls,
and then returning the result to the client asynchronously, either using
a callback function or sending the results back on a queue. Is this how
twisted "deferred"s work?
So, for the sake of argument, let's say that a similar structure is in
place in a WSGI framework. Further, let's say that database "results",
i.e. strings, ints, blobs, etc, from database columns will be yielded as
iterable data by some middleware component. These values will be
processed further down the middleware stack by some other component,
which, for example, is generating HTML pages containing the data.
Let's assume that there is a single I/O thread which is responsible for
communicating final results back to the user, i.e. through the client
socket. Due to the on-demand nature of the iterator which middleware
uses to return values, it is possible that the I/O thread could end up
executing database code. For example, say that the database data is
accessed through a python descriptor, meaning that accessing the data
may cause execution of python code in whatever python object retrieved
the data from database
Which will be detrimental to CPU cache locality.
Because the I/O thread will potentially execute code from every
component in the middleware stack, its thread of execution could meander
all over several megabytes of python bytecode. Which is pretty much
guaranteed to eliminate any benefit that may be provided by CPU caches.
In the worst case, this could cause significant cache "thrashing", as
lots of different pieces of bytecode clash and "fight" for space in the
CPU cache.
Welsh[1] states the problem like this: "In a thread-per-task system, the
instruction cache tends to take many misses as the thread's control
passes through many unrelated code modules to process the task. In
addition, whenever a context switch occurs (due to thread preemption or
blocking I/O call, say), other threads will invariably push the waiting
thread's state out of the cache. When the original thread resumes
execution, it will need to take many cache misses in order to bring its
code and state back into the cache. In this situation, all of the
threads in the system are competing for limited cache space."
The solution to this problem is for middleware components to only return
references to passive data, and never to return iterators that cause the
execution of python code.
I notice that Phillip has include a statement in PEP-0333 which states
in the section under "Buffering and Streaming":
"""
Generally speaking, applications will achieve the best throughput by
buffering their (modestly-sized) output and sending it all at once. When
this is the case, applications should simply return a single-element
iterable containing their entire output as a single string.
[snip]
For large files, however, or for specialized uses of HTTP streaming
(such as multipart "server push"), an application may need to provide
output in smaller blocks (e.g. to avoid loading a large file into
memory). It's also sometimes the case that part of a response may be
time-consuming to produce, but it would be useful to send ahead the
portion of the response that precedes it.
"""
Phillip, when you wrote about "performance" here, did you have CPU
cache's in mind?
Regards,
Alan.
1. A Design Framework for Highly Concurrent Systems
http://www.eecs.harvard.edu/~mdw/papers/events.pdf
More information about the Web-SIG
mailing list