overkill context switches

Hello,
when a page that requires _several_ sql queries (through psycopg2) is loaded, an overkill amount of context switches happens:
0 0 0 18772 614380 227500 0 0 0 20 1037 55 0 0 99 1 0 0 0 18772 614380 227500 0 0 0 0 1035 48 0 0 100 0 1 0 0 18772 614380 227500 0 0 0 0 1032 24613 17 31 53 0 0 0 0 18764 614380 227500 0 0 0 0 1086 6766 20 6 75 0 0 0 0 18764 614380 227500 0 0 0 0 1133 64 1 0 99 0
These shall be thread switches, otherwise I doubt my slow server system would be able to schedule 24k times per second and still provide decent performance.
However those useless context switches are certainly wasting quite some cpu. So I'd like to fix this.
There's an huge number of futex(FUTEX_WAKE) call in the strace, python calls futex_wake even when you simply invoke it with `python` from the shell (it makes no sense to call futex syscall until the first pthread_create is called, so this is certainly suboptimal but I'm unsure if it's related to the context switches, a flood of futex calls would just waste tonds of cpu with enter/exit kernels without necessairly switch the task). The point of futex is exactly to avoid entering the kernel in the fast path so this is all no-sense code.
I could reproduce the overkill context switches even with linuxthreads instead of NPTL, so this is probably not a NPTL bug.
It's also probably not a psycopg2 bug because psycopg has no notion of threading.
This should be a python or twisted-thread-pool bug. I'm not sure about the best way to track it down, especially because so far I can reproduce it only on the server, that is a bit slower and that I cannot use for debugging.
Can people try in their systems to run `vmstat 1` and then to reload some complex page, and see if you also get a flood of context switches for certainly no good reason?
I wonder if perhaps a select call on the thread is buggy, and it keeps yielding the cpu to the other tasks for some reasons. This doesn't seem a starvation issue, it seems just that some thread instead of waiting keeps looping and trying. So this is noticeable only as a slowdown, there is no effective malfunction.
If you can reproduce let me know, otherwise I'll try to debug it in the next days (it's not really urgent, since as said it's only a performance issue and if the system is under load the caching patches makes it fast anyway). OTOH watch the stats, 31% of the cpu is spent in system load, those are the context switches for sure, so the page would be rendered at least twice as fast if this was fixed, only 17% of the cpu was spent in userland rendering the page.

On Mon, Sep 26, 2005 at 05:50:24PM +0200, Andrea Arcangeli wrote: [...]
This should be a python or twisted-thread-pool bug. I'm not sure about the best way to track it down, especially because so far I can reproduce it only on the server, that is a bit slower and that I cannot use for debugging.
Twisted's threadpools are pretty straightforward wrappers around Python's threading module. See twisted.python.threadpool.
I'd expect you could reproduce this with a simple python script that starts two threads, with one putting items on a Queue.Queue, and the other reading them off, as a simple approximation of the threadpool arrangement.
In fact, here's a script that does this:
---- import threading import Queue
q = Queue.Queue()
def producer(): # Count from 0 to 99, over and over. i = 0 while True: q.put(i) i += 1 i %= 100
def consumer(): while True: i = q.get() i * 2
p = threading.Thread(target=producer) c = threading.Thread(target=consumer) p.setDaemon(True) c.setDaemon(True) p.start() c.start()
import time; time.sleep(10) ----
vmstat reports lines like:
1 1 350740 18828 38260 204880 0 0 2696 0 1099 100315 83 17 0 0 3 2 350740 16852 38336 206760 0 0 1956 20 1133 100174 86 14 0 0 1 1 350740 15116 38444 208352 0 0 1684 592 1106 99901 85 15 0 0 1 1 350740 12760 38520 210640 0 0 2364 0 1119 100183 87 13 0 0 1 1 350740 11024 38612 212336 0 0 1792 0 1123 100255 85 15 0 0 1 1 350740 9048 38696 214216 0 0 1964 0 1129 100057 85 15 0 0
(it also appears that the producer adds to the queue faster than the consumer gets from it. Doesn't really matter, aside from the memory cost.)
I don't see it with a simple script that has two threads running in parallel with no inter-thread communication at all, however. I'd say the place to look is Python's implementation of thread synchronisation primitives.
-Andrew.

On Tue, Sep 27, 2005 at 10:25:01AM +1000, Andrew Bennetts wrote:
On Mon, Sep 26, 2005 at 05:50:24PM +0200, Andrea Arcangeli wrote: [...]
This should be a python or twisted-thread-pool bug. I'm not sure about the best way to track it down, especially because so far I can reproduce it only on the server, that is a bit slower and that I cannot use for debugging.
Twisted's threadpools are pretty straightforward wrappers around Python's threading module. See twisted.python.threadpool.
I'd expect you could reproduce this with a simple python script that starts two threads, with one putting items on a Queue.Queue, and the other reading them off, as a simple approximation of the threadpool arrangement.
In fact, here's a script that does this:
import threading import Queue
q = Queue.Queue()
def producer(): # Count from 0 to 99, over and over. i = 0 while True: q.put(i) i += 1 i %= 100
def consumer(): while True: i = q.get() i * 2
p = threading.Thread(target=producer) c = threading.Thread(target=consumer) p.setDaemon(True) c.setDaemon(True) p.start() c.start()
import time; time.sleep(10)
vmstat reports lines like:
1 1 350740 18828 38260 204880 0 0 2696 0 1099 100315 83 17 0 0 3 2 350740 16852 38336 206760 0 0 1956 20 1133 100174 86 14 0 0 1 1 350740 15116 38444 208352 0 0 1684 592 1106 99901 85 15 0 0 1 1 350740 12760 38520 210640 0 0 2364 0 1119 100183 87 13 0 0 1 1 350740 11024 38612 212336 0 0 1792 0 1123 100255 85 15 0 0 1 1 350740 9048 38696 214216 0 0 1964 0 1129 100057 85 15 0 0
(it also appears that the producer adds to the queue faster than the consumer gets from it. Doesn't really matter, aside from the memory cost.)
I don't see it with a simple script that has two threads running in parallel with no inter-thread communication at all, however. I'd say the place to look is Python's implementation of thread synchronisation primitives.
Thanks for the example but the above looks quite normal and expected, this passes a token at a fast rate from one thread to the next. In my workload there are say 5/6 queries, two through pb, the others through psycopg2, so it should schedule a dozen times, not 20k times.
If I were to pass 10k separate tokens (similar to above testcase) then 20k schedules would have been normal.
I even tried pgasync and it also schedules >20k times. I tracked it down by executing multiple sql queries simultaneously. So now I wonder if it's postgres scheduling so fast and not twisted... Unfortunately I have no kernel knob installed to know who is scheduling so fast and strace alters the runtime as well.
I'll try to serialize the sql queries, so that no more than one query runs at once, and I guess that'll fix it just fine. (working on making this change right now)
It still looks to me that something could be optimized because on such a basic workload that many schedules are not needed (the futex syscalls as well makes no sense, especially with pgasync) but perhaps that's not on the twisted side that generates the overscheduling.

On Tue, Sep 27, 2005 at 12:07:02PM +0200, Andrea Arcangeli wrote:
I'll try to serialize the sql queries, so that no more than one query runs at once, and I guess that'll fix it just fine. (working on making this change right now)
that fixed it:
0 0 0 21776 300076 488504 0 0 0 0 1037 53 0 0 100 0 0 0 0 21776 300076 488504 0 0 0 0 1032 49 0 0 100 0 1 0 0 21712 300084 488496 0 0 0 4 1040 243 11 26 64 0 0 0 0 21704 300084 488496 0 0 0 0 1054 81 18 25 58 0 0 0 0 21704 300084 488496 0 0 0 0 1103 62 1 0 99 0 0 0 0 21704 300084 488496 0 0 0 0 1088 61 0 0 100 0
it's not scheduling anymore at fast rate. Apparently invoking multiple queries at the same time to the db through different threads each one returning to a different deferred, triggers the overscheduling. Perhaps it's not a bug, but it certainly isn't a feature either ;)

On Sep 27, 2005, at 3:07 AM, Andrea Arcangeli wrote:
So now I wonder if it's postgres scheduling so fast and not twisted... Unfortunately I have no kernel knob installed to know who is scheduling so fast and strace alters the runtime as well.
The non-kernel developers among us might just try running PostgreSQL on a different machine. ;)

On Tue, Sep 27, 2005 at 03:00:45PM -0700, Scott Lamb wrote:
The non-kernel developers among us might just try running PostgreSQL on a different machine. ;)
eheh ;)
If I could have reproduced it locally I would have tried it, but on the remote server behind firewalls to try that it would have been tricky and the extremely different timings and huge latency of connecting to the sql server through the internet would have probably invalidate the results.
Anyway it's fixed now by making serial requests to the db, so I tend to think it was postgres and not twisted and so I apologise for the noise.
A full vacuum with analyse (vaccumdb -z -f) also halved the time it takes to nevow to render the page.
participants (3)
-
Andrea Arcangeli
-
Andrew Bennetts
-
Scott Lamb