[Python-Dev] Fixing the GIL (with a BFS scheduler)

Sun May 16 22:07:06 CEST 2010

Hi all,

Here is a second (last?) attempt at getting traction on fixing the GIL (is
it broken?) with BFS (WTF?).
So don't be shy (don't be too rude either) since ignoring counts as down
voting.

Relevant Python issue: http://bugs.python.org/issue7946

*Bottom line first*

I submitted an implementation of BFS (
http://ck.kolivas.org/patches/bfs/sched-BFS.txt) as a patch to the GIL,
which to the extent I have tested it, behaves nicely on Windows XP, Windows
7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4 cores, laptop,
desktop and VirtualBox VM guest (some data below).

The patch is still work in progress and requires work in terms of style,
moving code where it belongs, test code, etc... nevertheless, Python core
developers recommended I already (re)post to python-dev for discussion.

*So is the GIL broken?*

There seems to be some disagreement on that question among Python core
developers (unless you all agree it is not broken :) ). Some developers
maintain the effects described by David Beazley do not affect real world
systems. Even I took the role of a devil's advocate in a previous
discussion, but in fact I think that Python, being a general purpose
language, is similar to the OS in that regard. It is used across many
application domains, platforms, and development paradigms, just as OS
schedulers
are, and therefore accepting thread scheduling with such properties as a
fact of life is not a good idea.

I was first bitten by the original GIL last year while testing a system, and
found David's research while looking for answers, and later had to work
around that problem in another system. Here are other real world cases:

1) Zope people hit this back in 2002 and documented the problem with
interesting insight:
http://www.zope.org/Members/glpb/solaris/multiproc
"I have directly observed a 30% penalty under MP constraints when the
sys.setcheckinterval value was too low (and there was too much GIL
thrashing)."

http://www.zope.org/Members/glpb/solaris/report_ps
"A machine that's going full-throttle isn't as bad, curiously enough --
because the other CPU's are busy doing real work, the GIL doesn't have as
much opportunity to get shuffled between CPUs.  On a MP box it's very
important to set sys.setcheckinterval() up to a fairly large number, I
recommend pystones / 50 or so."

2) Python mailing list - 2005
http://mail.python.org/pipermail/python-list/2005-August/336286.html
"The app suffers from serious performance degradation (compared to
pure c/C++) and high context switches that I suspect the GIL unlocking
may be aggravating ?"

3) Python mailing list - 2008
http://mail.python.org/pipermail/python-list/2008-June/1143217.html
"When I start the server, it sometimes eats up 100% of the CPU for a good
minute or so... though none of the  threads are CPU-intensive"

4) Twisted
http://twistedmatrix.com/pipermail/twisted-python/2005-July/011048.html
"When I run a CPU intensive method via threads.deferToThread it takes all
the CPU away and renders the twisted process unresponsive."

Admittedly, it is not easy to dig reports up in Google.

Finally, I think David explained the relevance of this problem quite nicely:
http://mail.python.org/pipermail/python-dev/2010-March/098416.html

*What about the new GIL?*

There is no real world experience with the new GIL since it is under
development. What we have is David's analysis and a few benchmarks from the
bug report.

*Evolving the GIL into a scheduler*

The problem addressed by the GIL has always been *scheduling* threads to the
interpreter, not just controlling access to it. The patches by Antoine and
David essentially evolve the GIL into a scheduler, however both cause thread
starvation or high rate of context switching in some scenarios (see data
below).

*BFS*

Enter BFS, a new scheduler designed by Con Kolivas, a Linux kernel hacker
who is an expert in this field:
http://ck.kolivas.org/patches/bfs/sched-BFS.txt
"The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is
to completely do away with the complex designs of the past for the cpu
process scheduler and instead implement one that is very simple in basic
design. The main focus of BFS is to achieve excellent desktop interactivity
and responsiveness without heuristics and tuning knobs that are difficult
to understand, impossible to model and predict the effect of, and when tuned
to one workload cause massive detriment to another."

I submitted an implementation of BFS (bfs.patch) which on my machines gives
comparable performance to gilinter2.patch (Antoine's) and seems to schedule
threads more fairly, predictably, and with lower rate of context switching
(see data below).

There are however, some issues in bfs.patch:

1) It works on top of the OS scheduler, which means (for all GIL patches!):
    a) It does not control and is not aware of information such as OS thread
preemption, CPU core to run on, etc...
    b) There may be hard to predict interaction between BFS and the
underlying OS scheduler, which needs to be tested on each target platform.

2) It works best when TSC (http://en.wikipedia.org/wiki/Time_Stamp_Counter)
is available and otherwise falls back to gettimeofday(). I expect the
scheduler to misbehave to some degree or affect performance when TSC is not
available and either of the following is true:
    a) if gettimeofday() is very expensive to read (impacts release/acquire
overhead).
    b) if gettimeofday() has very low precision ~10ms.

By design of BFS, once CPU load crosses a given threshold (about 8 CPU bound
tasks which need the CPU at once), the scheduler falls back to FIFO behavior
and latency goes up sharply.

I have no data on how bfs.patch behaves on ARM, AMD, old CPU models, OSX,
FreeBSD, Solaris, or mobile. The patch may require some tuning to work
properly on those systems, so data is welcome (make sure TSC code in
Include/cycle.h works on those systems before benching).

All that said, to the extent I have tested it, bfs.patch behaves nicely on
Windows XP, Windows 7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4
cores, laptop, desktop and VirtualBox VM guest.

*Data*

Comparison of proposed patches running ccbench on Windows XP:
http://bugs.python.org/issue7946#msg104899

Comparison of proposed patches running Florent's write.py test on Ubuntu
Karmic:
http://bugs.python.org/issue7946#msg105687

Comparison of old GIL, new GIL and BFS running ccbench on Ubuntu Karmic:
http://bugs.python.org/issue7946#msg105874

Last comparison includes a run of old GIL with sys.setcheckinterval(2500) as
Zope people do. IO latency shoots up to ~1000ms as result.

*What can be done with it?*

Here are some options:
1) Abandon it - no one is interested, yawn.
 2) Take ideas and workarounds from its code and apply to other patches.
3) Include it in the interpreter as an auxiliary (turn on with a runtime
switch) scheduler.
4) Adopt it as the Python scheduler.

*Opinion?*

Your opinion is needed (however, please submit code review comments which
are not likely to interest other people, e.g. "why did you use volatile for
X?", at the issue page: http://bugs.python.org/issue7946).

Thanks,
Nir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100516/845a35b2/attachment.html>