[Web-SIG] CPU cache locality.

Thu Sep 9 13:20:51 CEST 2004

[Alan Kennedy]
>> 3. Welsh's architecture discusses isolation of multiple IO subsystems 
>> into different thread groups. For example, there could be a thread 
>> group holding a pool of (blocking) database connections, which would 
>> be the appropriate "width" to process as many requests as can be 
>> concurrently supported by the RDBMS. Since there are blocking 
>> sockets/pipes/fifos between the application and the database, such 
>> database operations also count as a form of IO, which has to be 
>> managed. It could potentially be managed in an asynchronous fashion. 
>> Do any of the cpython frameworks support an asynchronous database API?

[Jp Calderone]
>   Yes, http://twistedmatrix.com/documents/current/howto/enterprise

Thanks for the reply Jp.

I've been thinking further about multi-threading, CPU cache locality and 
iterators. While I was thinking about it in relation to twisted 
enterprise at first, it's really an issue that applies to WSGI as well.

But let's take twisted enterprise as an example. I'm not intimately 
familiar with Twisted, so please forgive me if I get something wrong.

So twisted has a pool of threads which carries out synchronous database 
operations on behalf of clients, but in an asynchronous manner from the 
clients perspective. This is done by receiving the "database requests" 
from a queue, processing each synchronously using blocking DB-API calls, 
and then returning the result to the client asynchronously, either using 
a callback function or sending the results back on a queue. Is this how 
twisted "deferred"s work?

So, for the sake of argument, let's say that a similar structure is in 
place in a WSGI framework. Further, let's say that database "results", 
i.e. strings, ints, blobs, etc, from database columns will be yielded as 
iterable data by some middleware component. These values will be 
processed further down the middleware stack by some other component, 
which, for example, is generating HTML pages containing the data.

Let's assume that there is a single I/O thread which is responsible for 
communicating final results back to the user, i.e. through the client 
socket. Due to the on-demand nature of the iterator which middleware 
uses to return values, it is possible that the I/O thread could end up 
executing database code. For example, say that the database data is 
accessed through a python descriptor, meaning that accessing the data 
may cause execution of python code in whatever python object retrieved 
the data from database

Which will be detrimental to CPU cache locality.

Because the I/O thread will potentially execute code from every 
component in the middleware stack, its thread of execution could meander 
all over several megabytes of python bytecode. Which is pretty much 
guaranteed to eliminate any benefit that may be provided by CPU caches. 
In the worst case, this could cause significant cache "thrashing", as 
lots of different pieces of bytecode clash and "fight" for space in the 
CPU cache.

Welsh[1] states the problem like this: "In a thread-per-task system, the 
instruction cache tends to take many misses as the thread's control 
passes through many unrelated code modules to process the task. In 
addition, whenever a context switch occurs (due to thread preemption or 
  blocking I/O call, say), other threads will invariably push the waiting
thread's state out of the cache. When the original thread resumes 
execution, it will need to take many cache misses in order to bring its 
  code and state back into the cache. In this situation, all of the 
threads in the system are competing for limited cache space."

The solution to this problem is for middleware components to only return 
references to passive data, and never to return iterators that cause the 
execution of python code.

I notice that Phillip has include a statement in PEP-0333 which states 
in the section under "Buffering and Streaming":

"""
Generally speaking, applications will achieve the best throughput by 
buffering their (modestly-sized) output and sending it all at once. When 
this is the case, applications should simply return a single-element 
iterable containing their entire output as a single string.

[snip]

For large files, however, or for specialized uses of HTTP streaming 
(such as multipart "server push"), an application may need to provide 
output in smaller blocks (e.g. to avoid loading a large file into 
memory). It's also sometimes the case that part of a response may be 
time-consuming to produce, but it would be useful to send ahead the 
portion of the response that precedes it.
"""

Phillip, when you wrote about "performance" here, did you have CPU 
cache's in mind?

Regards,

Alan.

1. A Design Framework for Highly Concurrent Systems
http://www.eecs.harvard.edu/~mdw/papers/events.pdf