[Python-Dev] Reworking the GIL

Sun Oct 25 21:22:20 CET 2009

Hello there,

The last couple of days I've been working on an experimental rewrite of
the GIL. Since the work has been turning out rather successful (or, at
least, not totally useless and crashing!) I thought I'd announce it
here.

First I want to stress this is not about removing the GIL. There still
is a Global Interpreter Lock which serializes access to most parts of
the interpreter. These protected parts haven't changed either, so Python
doesn't become really better at extracting computational parallelism out
of several cores.

Goals
-----

The new GIL (which is also the name of the sandbox area I've committed
it in, "newgil") addresses the following issues :

1) Switching by opcode counting. Counting opcodes is a very crude way of
estimating times, since the time spent executing a single opcode can
very wildly. Litterally, an opcode can be as short as a handful of
nanoseconds (think something like "... is not None") or as long as a
fraction of second, or even longer (think calling a heavy non-GIL
releasing C function, such as re.search()). Therefore, releasing the GIL
every 100 opcodes, regardless of their length, is a very poor policy.

The new GIL does away with this by ditching _Py_Ticker entirely and
instead using a fixed interval (by default 5 milliseconds, but settable)
after which we ask the main thread to release the GIL and let another
thread be scheduled.

2) GIL overhead and efficiency in contended situations. Apparently, some
OSes (OS X mainly) have problems with lock performance when the lock is
already taken: the system calls are heavy. This is the "Dave Beazley
effect", where he took a very trivial loop, therefore made of very short
opcodes and therefore releasing the GIL very often (probably 100000
times a second), and runs it in one or two threads on an OS with poor
lock performance (OS X). He sees a 50% increase in runtime when using
two threads rather than one, in what is admittedly a pathological case.

Even on better platforms such as Linux, eliminating the overhead of many
GIL acquires and releases (since the new GIL is released on a fixed time
basis rather than on an opcode counting basis) yields slightly better
performance (read: a smaller performance degradation :-)) when there are
several pure Python computation threads running.

3) Thread switching latency. The traditional scheme merely releases the
GIL for a couple of CPU cycles, and reacquires it immediately.
Unfortunately, this doesn't mean the OS will automatically switch to
another, GIL-awaiting thread. In many situations, the same thread will
continue running. This, with the opcode counting scheme, is the reason
why some people have been complaining about latency problems when an I/O
thread competes with a computational thread (the I/O thread wouldn't be
scheduled right away when e.g. a packet arrives; or rather, it would be
scheduled by the OS, but unscheduled immediately when trying to acquire
the GIL, and it would be scheduled again only much later).

The new GIL improves on this by combinating two mechanisms:
- forced thread switching, which means that when the switching interval
is terminated (mentioned in 1) and the GIL is released, we will force
any of the threads waiting on the GIL to be scheduled instead of the
formerly GIL-holding thread. Which thread exactly is an OS decision,
however: the goal here is not to have our own scheduler (this could be
discussed but I wanted the design to remain simple :-) After all,
man-years of work have been invested in scheduling algorithms by kernel
programming teams).
- priority requests, which is an option for a thread requesting the GIL
to be scheduled as soon as possible, and forcibly (rather than any other
threads). This is meant to be used by GIL-releasing methods such as
read() on files and sockets. The scheme, again, is very simple: when a
priority request is done by a thread, the GIL is released as soon as
possible by the thread holding it (including in the eval loop), and then
the thread making the priority request is forcibly scheduled (by making
all other GIL-awaiting threads wait in the meantime).

Implementation
--------------

The new GIL is implemented using a couple of mutexes and condition
variables. A {mutex, condition} pair is used to protect the GIL itself,
which is a mere variable named `gil_locked` (there are a couple of other
variables for bookkeeping). Another {mutex, condition} pair is used for
forced thread switching (described above). Finally, a separate mutex is
used for priority requests (described above).

The code is in the sandbox:
http://svn.python.org/view/sandbox/trunk/newgil/

The file of interest is Python/ceval_gil.h. Changes in other files are
very minimal, except for priority requests which have been added at
strategic places (some methods of I/O modules). Also, the code remains
rather short, while of course being less trivial than the old one.

NB : this is a branch of py3k. There should be no real difficulty
porting it back to trunk, provided someone wants to do the job.

Platforms
---------

I've implemented the new GIL for POSIX and Windows (tested under Linux
and Windows XP (running in a VM)). Judging by what I can read in the
online MSDN docs, the Windows support should include everything from
Windows 2000, and probably recent versions of Windows CE.

Other platforms aren't implemented, because I don't have access to the
necessary hardware. Besides, I must admit I'm not very motivated in
working on niche/obsolete systems. I've e-mailed Andrew MacIntyre in
private to ask him if he'd like to do the OS/2 support.

Supporting a new platform is not very difficult: it's a matter of
writing the 50-or-so lines of necessary platform-specific macros at the
beginning of Python/ceval_gil.h.

The reason I couldn't use the existing thread support
(Python/thread_*.h) is that these abstractions are too poor. Mainly,
they don't provide:
- events, conditions or an equivalent thereof
- the ability to acquire a resource with a timeout

Measurements
------------

Before starting this work, I wrote ccbench (*), a little benchmark
script ("ccbench" being a shorthand for "concurrency benchmark") which
measures two things:
- computation throughput with one or several concurrent threads
- latency to external events (I use an UDP socket) when there is zero,
one, or several background computation threads running

(*) http://svn.python.org/view/sandbox/trunk/ccbench/

The benchmark involves several computation workloads with different GIL
characteristics. By default there are 3 of them:
A- one pure Python workload (computation of a number of digits of pi):
that is, something which spends its time in the eval loop
B- one mostly C workload where the C implementation doesn't release the
GIL (regular expression matching)
C- one mostly C workload where the implementation does release the GIL
(bz2 compression)

In the ccbench directory you will find benchmark results, under Linux,
for two different systems I have here. The new GIL shows roughly similar
but slightly better throughput results than the old one. And it is much
better in the latency tests, especially in workload B (going down from
almost a second of average latency with the old GIL, to a couple of
milliseconds with the new GIL). This is the combined result of using a
time-based scheme (rather than opcode-based) and of forced thread
switching (rather than relying on the OS to actually switch threads when
we speculatively release the GIL).

As a sidenote, I might mention that single-threaded performance is not
degraded at all. It is, actually, theoretically a bit better because the
old ticker check in the eval loop becomes simpler; however, this goes
mostly unnoticed.

Now what remains to be done?

Having other people test it would be fine. Even better if you have an
actual multi-threaded py3k application. But ccbench results for other
OSes would be nice too :-)
(I get good results under the Windows XP VM but I feel that a VM is not
an ideal setup for a concurrency benchmark)

Of course, studying and reviewing the code is welcome. As for
integrating it into the mainline py3k branch, I guess we have to answer
these questions:
- is the approach interesting? (we could decide that it's just not worth
it, and that a good GIL can only be a dead (removed) GIL)
- is the patch good, mature and debugged enough?
- how do we deal with the unsupported platforms (POSIX and Windows
support should cover most bases, but the fate of OS/2 support depends on
Andrew)?

Regards

Antoine.