Improved thread switching
The company I work for has over the last couple of years created an
application server for use in most of our customer projects. It embeds Python
and most project code is written in Python by now. It is quite resource-hungry
(several GB of RAM, MySQL databases of 50-100GB). And of course it is
multi-threaded and, at least originally, we hoped to make it utilize multiple
processor cores. Which, as we all know, doesn't sit very well with Python. Our
application runs heavy background calculations most of the time (in Python)
and has to service multiple (few) GUI clients at the same time, also using
Python. The problem was that a single background thread would increase the
response time of the client threads by a factor of 10 or (usually) more.
This led me to add a dirty hack to the Python core to make it switch threads
more frequently. While this hack greatly improved response time for the GUI
clients, it also slowed down the background threads quite a bit. top would
often show significantly less CPU usage -- 80% instead of the more usual 100%.
The problem with thread switching in Python is that the global semaphore used
for the GIL is regularly released and immediately reacquired. Unfortunately,
most of the time this leads to the very same thread winning the race on the
semaphore again and thus more wait time for the other threads. This is where
my dirty patch intervened and just did a nanosleep() for a short amount of
time (I used 1000 nsecs).
I have then created a better scheduling scheme and written a small test
program that nicely mimics what Python does for some statistics. I call the
scheduling algorithm the round-robin semaphore because threads can now run in
a more or less round-robin fashion. Actually, it's just a semaphore with FIFO
semantics.
The implementation problem with the round-robin semaphore is the __thread
variable I had to use because I did not want to change the signature of the
Enter() and Leave() methods. For CPython, I have replaced this thread-local
allocation with an additional field in the PyThreadState. Because of that, the
patch for CPython I have already created is a bit more involved than the
simple nanosleep() hack. Consequently, it's not very polished yet and not at
all as portable as the rest of the Python core.
I now show you the results from the test program which compares all three
scheduling mechanisms -- standard python, my dirty hack and the new
round-robin semaphore. I also show you the test program containing the three
implementations nicely encapsulated.
The program was run on a quad-core Xeon 1.86 GHz on Fedora 5 x86_64. The first
three lines from the output (including the name of the algorithm) should be
self-explanatory. The fourth and the fifth show a distribution of wait times
for the individual threads. The ideal distribution would be everything on the
number of threads (2 in this case) and zero everywhere else. As you can see,
the round-robin semaphore is pretty close to that. Also, because of the high
thread switching frequency, we could lower Python's checkinterval -- the jury
is still out on the actual value, likely something between 1000 and 10000.
I can post my Python patch if there is enough interest.
Thanks for your attention.
Synch: Python lock
iteration count: 24443
thread switches: 10
1 2 3 4 5 6 7 8 9 10 -10 -50 -100 -1k more
24433 0 0 0 0 0 0 0 0 0 0 1 1 6 0
Synch: Dirty lock
iteration count: 25390
thread switches: 991
1 2 3 4 5 6 7 8 9 10 -10 -50 -100 -1k more
24399 10 0 0 0 0 1 0 1 0 975 1 1 0 0
Synch: round-robin semaphore
iteration count: 23023
thread switches: 22987
1 2 3 4 5 6 7 8 9 10 -10 -50 -100 -1k more
36 22984 0 0 0 0 0 0 0 0 1 0 0 0 0
// compile with g++ -g -O0 -pthread -Wall p.cpp
#include
On Tue, Mar 18, 2008 at 1:29 AM, Stefan Ring
The company I work for has over the last couple of years created an application server for use in most of our customer projects. It embeds Python and most project code is written in Python by now. It is quite resource-hungry (several GB of RAM, MySQL databases of 50-100GB). And of course it is multi-threaded and, at least originally, we hoped to make it utilize multiple processor cores. Which, as we all know, doesn't sit very well with Python. Our application runs heavy background calculations most of the time (in Python) and has to service multiple (few) GUI clients at the same time, also using Python. The problem was that a single background thread would increase the response time of the client threads by a factor of 10 or (usually) more.
This led me to add a dirty hack to the Python core to make it switch threads more frequently. While this hack greatly improved response time for the GUI clients, it also slowed down the background threads quite a bit. top would often show significantly less CPU usage -- 80% instead of the more usual 100%.
The problem with thread switching in Python is that the global semaphore used for the GIL is regularly released and immediately reacquired. Unfortunately, most of the time this leads to the very same thread winning the race on the semaphore again and thus more wait time for the other threads. This is where my dirty patch intervened and just did a nanosleep() for a short amount of time (I used 1000 nsecs).
Can you try with a call to sched_yield(), rather than nanosleep()? It should have the same benefit but without as much performance hit. If it works, but is still too much hit, try tuning the checkinterval to see if you can find an acceptable throughput/responsiveness balance. -- Adam Olsen, aka Rhamphoryncus
Adam Olsen
Can you try with a call to sched_yield(), rather than nanosleep()? It should have the same benefit but without as much performance hit.
If it works, but is still too much hit, try tuning the checkinterval to see if you can find an acceptable throughput/responsiveness balance.
I tried that, and it had no effect whatsoever. I suppose it would make an effect on a single CPU or an otherwise heavily loaded SMP system but that's not the secnario we care about.
On Wed, Mar 19, 2008 at 10:09 AM, Stefan Ring
Adam Olsen
writes: Can you try with a call to sched_yield(), rather than nanosleep()? It should have the same benefit but without as much performance hit.
If it works, but is still too much hit, try tuning the checkinterval to see if you can find an acceptable throughput/responsiveness balance.
I tried that, and it had no effect whatsoever. I suppose it would make an effect on a single CPU or an otherwise heavily loaded SMP system but that's not the secnario we care about.
So you've got a lightly loaded SMP system? Multiple threads all blocked on the GIL, multiple CPUs to run them, but only one CPU is active? I that case I can imagine how sched_yield() might finish before the other CPUs wake up a thread. A FIFO scheduler would be the right thing here, but it's only a short term solution. Care for a long term solution? ;) http://code.google.com/p/python-safethread/ -- Adam Olsen, aka Rhamphoryncus
Adam Olsen
On Wed, Mar 19, 2008 at 10:09 AM, Stefan Ring
wrote: Adam Olsen
writes: Can you try with a call to sched_yield(), rather than nanosleep()? It should have the same benefit but without as much performance hit.
If it works, but is still too much hit, try tuning the checkinterval to see if you can find an acceptable throughput/responsiveness balance.
I tried that, and it had no effect whatsoever. I suppose it would make an
effect
on a single CPU or an otherwise heavily loaded SMP system but that's not the secnario we care about.
So you've got a lightly loaded SMP system? Multiple threads all blocked on the GIL, multiple CPUs to run them, but only one CPU is active? I that case I can imagine how sched_yield() might finish before the other CPUs wake up a thread.
A FIFO scheduler would be the right thing here, but it's only a short term solution. Care for a long term solution? ;)
I've already seen that but it would not help us in our current situation. The performance penalty really is too heavy. Our system is slow enough already ;). And it would be very difficult bordering on impossible to parallelize Plus, I can imagine that all extension modules (and our own code) would have to be adapted. The FIFO scheduler is perfect for us because the load is typically quite low. It's mostly at those times when someone runs a lengthy calculation that all other users suffer greatly increased response times.
participants (2)
-
Adam Olsen
-
Stefan Ring