[Python-Dev] Improved thread switching

Tue Mar 18 08:29:10 CET 2008

The company I work for has over the last couple of years created an
application server for use in most of our customer projects. It embeds Python
and most project code is written in Python by now. It is quite resource-hungry
(several GB of RAM, MySQL databases of 50-100GB). And of course it is
multi-threaded and, at least originally, we hoped to make it utilize multiple
processor cores. Which, as we all know, doesn't sit very well with Python. Our
application runs heavy background calculations most of the time (in Python)
and has to service multiple (few) GUI clients at the same time, also using
Python. The problem was that a single background thread would increase the
response time of the client threads by a factor of 10 or (usually) more.

This led me to add a dirty hack to the Python core to make it switch threads
more frequently. While this hack greatly improved response time for the GUI
clients, it also slowed down the background threads quite a bit. top would
often show significantly less CPU usage -- 80% instead of the more usual 100%.

The problem with thread switching in Python is that the global semaphore used
for the GIL is regularly released and immediately reacquired. Unfortunately,
most of the time this leads to the very same thread winning the race on the
semaphore again and thus more wait time for the other threads. This is where
my dirty patch intervened and just did a nanosleep() for a short amount of
time (I used 1000 nsecs).

I have then created a better scheduling scheme and written a small test
program that nicely mimics what Python does for some statistics. I call the
scheduling algorithm the round-robin semaphore because threads can now run in
a more or less round-robin fashion. Actually, it's just a semaphore with FIFO
semantics.

The implementation problem with the round-robin semaphore is the __thread
variable I had to use because I did not want to change the signature of the
Enter() and Leave() methods. For CPython, I have replaced this thread-local
allocation with an additional field in the PyThreadState. Because of that, the
patch for CPython I have already created is a bit more involved than the
simple nanosleep() hack. Consequently, it's not very polished yet and not at
all as portable as the rest of the Python core.

I now show you the results from the test program which compares all three
scheduling mechanisms -- standard python, my dirty hack and the new
round-robin semaphore. I also show you the test program containing the three
implementations nicely encapsulated.

The program was run on a quad-core Xeon 1.86 GHz on Fedora 5 x86_64. The first
three lines from the output (including the name of the algorithm) should be
self-explanatory. The fourth and the fifth show a distribution of wait times
for the individual threads. The ideal distribution would be everything on the
number of threads (2 in this case) and zero everywhere else. As you can see,
the round-robin semaphore is pretty close to that. Also, because of the high
thread switching frequency, we could lower Python's checkinterval -- the jury
is still out on the actual value, likely something between 1000 and 10000.

I can post my Python patch if there is enough interest.

Thanks for your attention.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: results.txt
Url: http://mail.python.org/pipermail/python-dev/attachments/20080318/4b3d5d53/attachment.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: p.cpp
Url: http://mail.python.org/pipermail/python-dev/attachments/20080318/4b3d5d53/attachment-0001.txt