when a page that requires _several_ sql queries (through psycopg2) is
loaded, an overkill amount of context switches happens:
0 0 0 18772 614380 227500 0 0 0 20 1037 55 0 0 99 1
0 0 0 18772 614380 227500 0 0 0 0 1035 48 0 0 100 0
1 0 0 18772 614380 227500 0 0 0 0 1032 24613 17 31 53 0
0 0 0 18764 614380 227500 0 0 0 0 1086 6766 20 6 75 0
0 0 0 18764 614380 227500 0 0 0 0 1133 64 1 0 99 0
These shall be thread switches, otherwise I doubt my slow server system would
be able to schedule 24k times per second and still provide decent performance.
However those useless context switches are certainly wasting quite some cpu. So
I'd like to fix this.
There's an huge number of futex(FUTEX_WAKE) call in the strace, python calls
futex_wake even when you simply invoke it with `python` from the shell (it
makes no sense to call futex syscall until the first pthread_create is called,
so this is certainly suboptimal but I'm unsure if it's related to the
context switches, a flood of futex calls would just waste tonds of cpu
with enter/exit kernels without necessairly switch the task). The point
of futex is exactly to avoid entering the kernel in the fast path so
this is all no-sense code.
I could reproduce the overkill context switches even with linuxthreads instead
of NPTL, so this is probably not a NPTL bug.
It's also probably not a psycopg2 bug because psycopg has no notion of threading.
This should be a python or twisted-thread-pool bug. I'm not sure about the best
way to track it down, especially because so far I can reproduce it only on the
server, that is a bit slower and that I cannot use for debugging.
Can people try in their systems to run `vmstat 1` and then to reload
some complex page, and see if you also get a flood of context switches
for certainly no good reason?
I wonder if perhaps a select call on the thread is buggy, and it keeps
yielding the cpu to the other tasks for some reasons. This doesn't seem
a starvation issue, it seems just that some thread instead of waiting
keeps looping and trying. So this is noticeable only as a slowdown,
there is no effective malfunction.
If you can reproduce let me know, otherwise I'll try to debug it in the
next days (it's not really urgent, since as said it's only a performance
issue and if the system is under load the caching patches makes it fast
anyway). OTOH watch the stats, 31% of the cpu is spent in system load,
those are the context switches for sure, so the page would be rendered
at least twice as fast if this was fixed, only 17% of the cpu was spent
in userland rendering the page.