
Hello there,
The last couple of days I've been working on an experimental rewrite of the GIL. Since the work has been turning out rather successful (or, at least, not totally useless and crashing!) I thought I'd announce it here.
First I want to stress this is not about removing the GIL. There still is a Global Interpreter Lock which serializes access to most parts of the interpreter. These protected parts haven't changed either, so Python doesn't become really better at extracting computational parallelism out of several cores.
Goals -----
The new GIL (which is also the name of the sandbox area I've committed it in, "newgil") addresses the following issues :
1) Switching by opcode counting. Counting opcodes is a very crude way of estimating times, since the time spent executing a single opcode can very wildly. Litterally, an opcode can be as short as a handful of nanoseconds (think something like "... is not None") or as long as a fraction of second, or even longer (think calling a heavy non-GIL releasing C function, such as re.search()). Therefore, releasing the GIL every 100 opcodes, regardless of their length, is a very poor policy.
The new GIL does away with this by ditching _Py_Ticker entirely and instead using a fixed interval (by default 5 milliseconds, but settable) after which we ask the main thread to release the GIL and let another thread be scheduled.
2) GIL overhead and efficiency in contended situations. Apparently, some OSes (OS X mainly) have problems with lock performance when the lock is already taken: the system calls are heavy. This is the "Dave Beazley effect", where he took a very trivial loop, therefore made of very short opcodes and therefore releasing the GIL very often (probably 100000 times a second), and runs it in one or two threads on an OS with poor lock performance (OS X). He sees a 50% increase in runtime when using two threads rather than one, in what is admittedly a pathological case.
Even on better platforms such as Linux, eliminating the overhead of many GIL acquires and releases (since the new GIL is released on a fixed time basis rather than on an opcode counting basis) yields slightly better performance (read: a smaller performance degradation :-)) when there are several pure Python computation threads running.
3) Thread switching latency. The traditional scheme merely releases the GIL for a couple of CPU cycles, and reacquires it immediately. Unfortunately, this doesn't mean the OS will automatically switch to another, GIL-awaiting thread. In many situations, the same thread will continue running. This, with the opcode counting scheme, is the reason why some people have been complaining about latency problems when an I/O thread competes with a computational thread (the I/O thread wouldn't be scheduled right away when e.g. a packet arrives; or rather, it would be scheduled by the OS, but unscheduled immediately when trying to acquire the GIL, and it would be scheduled again only much later).
The new GIL improves on this by combinating two mechanisms: - forced thread switching, which means that when the switching interval is terminated (mentioned in 1) and the GIL is released, we will force any of the threads waiting on the GIL to be scheduled instead of the formerly GIL-holding thread. Which thread exactly is an OS decision, however: the goal here is not to have our own scheduler (this could be discussed but I wanted the design to remain simple :-) After all, man-years of work have been invested in scheduling algorithms by kernel programming teams). - priority requests, which is an option for a thread requesting the GIL to be scheduled as soon as possible, and forcibly (rather than any other threads). This is meant to be used by GIL-releasing methods such as read() on files and sockets. The scheme, again, is very simple: when a priority request is done by a thread, the GIL is released as soon as possible by the thread holding it (including in the eval loop), and then the thread making the priority request is forcibly scheduled (by making all other GIL-awaiting threads wait in the meantime).
Implementation --------------
The new GIL is implemented using a couple of mutexes and condition variables. A {mutex, condition} pair is used to protect the GIL itself, which is a mere variable named `gil_locked` (there are a couple of other variables for bookkeeping). Another {mutex, condition} pair is used for forced thread switching (described above). Finally, a separate mutex is used for priority requests (described above).
The code is in the sandbox: http://svn.python.org/view/sandbox/trunk/newgil/
The file of interest is Python/ceval_gil.h. Changes in other files are very minimal, except for priority requests which have been added at strategic places (some methods of I/O modules). Also, the code remains rather short, while of course being less trivial than the old one.
NB : this is a branch of py3k. There should be no real difficulty porting it back to trunk, provided someone wants to do the job.
Platforms ---------
I've implemented the new GIL for POSIX and Windows (tested under Linux and Windows XP (running in a VM)). Judging by what I can read in the online MSDN docs, the Windows support should include everything from Windows 2000, and probably recent versions of Windows CE.
Other platforms aren't implemented, because I don't have access to the necessary hardware. Besides, I must admit I'm not very motivated in working on niche/obsolete systems. I've e-mailed Andrew MacIntyre in private to ask him if he'd like to do the OS/2 support.
Supporting a new platform is not very difficult: it's a matter of writing the 50-or-so lines of necessary platform-specific macros at the beginning of Python/ceval_gil.h.
The reason I couldn't use the existing thread support (Python/thread_*.h) is that these abstractions are too poor. Mainly, they don't provide: - events, conditions or an equivalent thereof - the ability to acquire a resource with a timeout
Measurements ------------
Before starting this work, I wrote ccbench (*), a little benchmark script ("ccbench" being a shorthand for "concurrency benchmark") which measures two things: - computation throughput with one or several concurrent threads - latency to external events (I use an UDP socket) when there is zero, one, or several background computation threads running
(*) http://svn.python.org/view/sandbox/trunk/ccbench/
The benchmark involves several computation workloads with different GIL characteristics. By default there are 3 of them: A- one pure Python workload (computation of a number of digits of pi): that is, something which spends its time in the eval loop B- one mostly C workload where the C implementation doesn't release the GIL (regular expression matching) C- one mostly C workload where the implementation does release the GIL (bz2 compression)
In the ccbench directory you will find benchmark results, under Linux, for two different systems I have here. The new GIL shows roughly similar but slightly better throughput results than the old one. And it is much better in the latency tests, especially in workload B (going down from almost a second of average latency with the old GIL, to a couple of milliseconds with the new GIL). This is the combined result of using a time-based scheme (rather than opcode-based) and of forced thread switching (rather than relying on the OS to actually switch threads when we speculatively release the GIL).
As a sidenote, I might mention that single-threaded performance is not degraded at all. It is, actually, theoretically a bit better because the old ticker check in the eval loop becomes simpler; however, this goes mostly unnoticed.
Now what remains to be done?
Having other people test it would be fine. Even better if you have an actual multi-threaded py3k application. But ccbench results for other OSes would be nice too :-) (I get good results under the Windows XP VM but I feel that a VM is not an ideal setup for a concurrency benchmark)
Of course, studying and reviewing the code is welcome. As for integrating it into the mainline py3k branch, I guess we have to answer these questions: - is the approach interesting? (we could decide that it's just not worth it, and that a good GIL can only be a dead (removed) GIL) - is the patch good, mature and debugged enough? - how do we deal with the unsupported platforms (POSIX and Windows support should cover most bases, but the fate of OS/2 support depends on Andrew)?
Regards
Antoine.

[SNIP - a lot of detail on what sounds like a good design]
Now what remains to be done?
Having other people test it would be fine. Even better if you have an actual multi-threaded py3k application. But ccbench results for other OSes would be nice too :-) (I get good results under the Windows XP VM but I feel that a VM is not an ideal setup for a concurrency benchmark)
Of course, studying and reviewing the code is welcome. As for integrating it into the mainline py3k branch, I guess we have to answer these questions:
- is the approach interesting? (we could decide that it's just not worth
it, and that a good GIL can only be a dead (removed) GIL)
I think it's worth it. Removal of the GIL is a totally open-ended problem with no solution in sight. This, on the other hand, is a performance benefit now. I say move forward with this. If it happens to be short-lived because some actually figures out how to remove the GIL then great, but is that really going to happen between now and Python 3.2? I doubt it.
- is the patch good, mature and debugged enough?
- how do we deal with the unsupported platforms (POSIX and Windows
support should cover most bases, but the fate of OS/2 support depends on Andrew)?
It's up to Andrew to get the support in. While I have faith he will, this is why we have been scaling back the support for alternative OSs for a while and will continue to do so. I suspect the day Andrew stops keeping up will be the day we push to have OS/2 be externally maintained.
-Brett

Brett Cannon wrote:
It's up to Andrew to get the support in. While I have faith he will, this is why we have been scaling back the support for alternative OSs for a while and will continue to do so. I suspect the day Andrew stops keeping up will be the day we push to have OS/2 be externally maintained.
Notwithstanding my desire to keep OS/2 supported in the Python tree, keeping up has been more difficult of late: - OS/2 is unquestionably a "legacy" environment, with system APIs different in flavour and semantics from the current mainstream (though surprisingly capable in many ways despite its age). - The EMX runtime my OS/2 port currently relies on to abstract the system API to a Posix-ish API is itself a legacy package, essentially unmaintained for some years :-( This has been a source of increasing pain as Python has moved with the mainstream... with regard to Unicode support and threads in conjunction with multi-processing, in particular.
Real Life hasn't been favourably disposed either...
I have refrained from applying the extensive patches required to make the port feature complete for 2.6 and later while I investigate an alternate Posix emulating runtime (derived from FreeBSD's C library, and which is used by Mozilla on OS/2), which would allow me to dispense with most of these patches. But it has an issue or two of its own...
The cost in effort has been compounded by effectively having to try and maintain two ports - 2.x and 3.x. And the 3.x port has suffered more as its demands are higher.
So while I asked to keep the OS/2 thread support alive, if a decision were to be taken to remove OS/2 support from the Python 3.x sources I could live with that. A completed migration to Mercurial might well make future port maintenance easier for me.
Regards, Andrew.

Hello again,
Brett Cannon <brett <at> python.org> writes:
I think it's worth it. Removal of the GIL is a totally open-ended problem with no solution in sight. This, on the other hand, is a performance benefit now. I say move forward with this. If it happens to be short-lived because some actually figures out how to remove the GIL then great, but is that really going to happen between now and Python 3.2? I doubt it.
Based on this whole discussion, I think I am going to merge the new GIL work into the py3k branch, with priority requests disabled.
If you think this is premature or uncalled for, or if you just want to review the changes before making a judgement, please voice up :)
Regards
Antoine.

On Sun, Nov 1, 2009 at 03:33, Antoine Pitrou solipsis@pitrou.net wrote:
Hello again,
Brett Cannon <brett <at> python.org> writes:
I think it's worth it. Removal of the GIL is a totally open-ended problem with no solution in sight. This, on the other hand, is a performance benefit now. I say move forward with this. If it happens to be short-lived because some actually figures out how to remove the GIL then great, but is that really going to happen between now and Python 3.2? I doubt it.
Based on this whole discussion, I think I am going to merge the new GIL work into the py3k branch, with priority requests disabled.
This will be a nice Py3K carrot!
If you think this is premature or uncalled for, or if you just want to review the changes before making a judgement, please voice up :)
I know I personally trust you to not mess it up, Antoine, but that might also come from mental exhaustion and laziness. =)
-Brett

Antoine Pitrou wrote:
Based on this whole discussion, I think I am going to merge the new GIL work into the py3k branch, with priority requests disabled.
If you think this is premature or uncalled for, or if you just want to review the changes before making a judgement, please voice up :)
+1 from me. I trust you like Brett does.
How much work would it cost to make your patch optional at compile time? For what it's worth we could compare your work on different machines and on different platforms before it gets enabled by default. Can you imagine scenarios where your implementation might be slower than the current GIL implementation?

Christian Heimes <lists <at> cheimes.de> writes:
+1 from me. I trust you like Brett does.
How much work would it cost to make your patch optional at compile time?
Quite a bit, because it changes the logic for processing asynchronous pending calls (signals) and asynchronous exceptions in the eval loop. The #defines would get quite convoluted, I think; I'd prefer not to do that.
For what it's worth we could compare your work on different machines and on different platforms before it gets enabled by default. Can you imagine scenarios where your implementation might be slower than the current GIL implementation?
I don't really think so. The GIL is taken and released much more predictably than it was before. The thing that might be worth checking is a workload with many threads (say 50 or 100). Does anyone have that?
Regards
Antoine.

Antoine Pitrou wrote:
Christian Heimes <lists <at> cheimes.de> writes:
+1 from me. I trust you like Brett does.
How much work would it cost to make your patch optional at compile time?
Quite a bit, because it changes the logic for processing asynchronous pending calls (signals) and asynchronous exceptions in the eval loop. The #defines would get quite convoluted, I think; I'd prefer not to do that.
Based on the new piece of information I totally agree.
I don't really think so. The GIL is taken and released much more predictably than it was before. The thing that might be worth checking is a workload with many threads (say 50 or 100). Does anyone have that?
I don't have an application that works on Python 3 and uses that many threads, sorry.
Christian

On Sun, Nov 1, 2009 at 3:33 AM, Antoine Pitrou solipsis@pitrou.net wrote:
Hello again,
Brett Cannon <brett <at> python.org> writes:
I think it's worth it. Removal of the GIL is a totally open-ended problem with no solution in sight. This, on the other hand, is a performance benefit now. I say move forward with this. If it happens to be short-lived because some actually figures out how to remove the GIL then great, but is that really going to happen between now and Python 3.2? I doubt it.
Based on this whole discussion, I think I am going to merge the new GIL work into the py3k branch, with priority requests disabled.
If you think this is premature or uncalled for, or if you just want to review the changes before making a judgement, please voice up :)
+1 Good idea. Thats the best way to make sure this work gets anywhere. It can be iterated on from there if anyone has objections.

Hi Antoine, I was finally able to compile py3k and run the benchmark (my compilation issue was caused by checking out on Windows and compiling on Unix. Some Makefile templates are missing correct EOL properties in SVN I think).
The benchmark results can be obtained from: http://gaiacrtn.free.fr/py/benchmark-newgil/benchmark-newgil.tar.bz2 and viewed from: http://gaiacrtn.free.fr/py/benchmark-newgil/
I ran the benchmark on two platforms:
- Solaris X86, 16 cores: some python extension are likely missing (see config.log) - Windows XP SP3, 4 cores: all python extensions but TCL (I didn't bother checking why it failed as it is not used in the benchmark). It is a release build.
The results look promising but I let you share your conclusion (some latency results seem a bit strange from my understanding).
Side-note: PCBuild requires nasmw.exe but it no longer exists in the latest version. I had to rename nasm.exe to nasmw.exe. Would be nice to add this to the readme to avoid confusion...
Baptiste.
2009/11/1 Antoine Pitrou solipsis@pitrou.net
Hello again,
Brett Cannon <brett <at> python.org> writes:
I think it's worth it. Removal of the GIL is a totally open-ended problem with no solution in sight. This, on the other hand, is a performance
benefit
now. I say move forward with this. If it happens to be short-lived
because
some actually figures out how to remove the GIL then great, but is that really going to happen between now and Python 3.2? I doubt it.
Based on this whole discussion, I think I am going to merge the new GIL work into the py3k branch, with priority requests disabled.

Hello,
Solaris X86, 16 cores: some python extension are likely missing (see
config.log)
Windows XP SP3, 4 cores: all python extensions but TCL (I didn't bother
checking why it failed as it is not used in the benchmark). It is a release build.
The results look promising but I let you share your conclusion (some latency
results seem a bit strange from my understanding).
Thank you! The latency results don't look that strange to me.
If you're surprised that py3k shows better latency than newgil on the "pi calculation" workload, it's easy to understand why: py3k speculatively releases the GIL so often on that workload (every 100 opcodes) that the latencies are indeed very good, but if you look at the corresponding throughput numbers they are dismal (your Solaris box shows it falling to less than 20% with two threads running compared to the baseline number for single-thread execution, and on your Windows box the number is hardly better with 45%).
So, to sum it up, the way the current GIL manages to have good latencies is by issueing an unreasonable number of system calls on a contended lock, and potentially killing throughput performance (this depends on the OS too, because numbers under Linux are not so catastrophic).
The new GIL, on the other hand, is much more balanced in that it achieves rather predictable latencies (especially when you don't overcommit the OS by issueing more computational threads than you have CPU cores) while preserving throughput performance.
Side-note: PCBuild requires nasmw.exe but it no longer exists in the latest version. I had to rename nasm.exe to nasmw.exe. Would be nice to add this to the readme to avoid confusion...
You should file a bug on http://bugs.python.org
Regards
Antoine.

Antoine,
How close are you to merging this into the Py3k branch? It looks like a solid piece of work, that can only get better in the period between now and the release of 3.2. But I don't want to rush you, and I only have had a brief look at your code. (I whipped up a small Dave Beazley example and was impressed by the performance of your code compared to the original py3k branch -- from 150% to 100% CPU usage according to top with only 2 threads.)
My only suggestion so far: the description could use more explicit documentation on the various variables and macros and how they combine.
I also expect that priority requests aren't important; it somehow seems strange that if multiple threads are all doing I/O each new thread whose I/O completes would get to preempt whoever else is active immediately. (Also the choice of *not* making a priority request when a read returns no bytes seems strange 00 if I read the code correctly.)
Anyway, thanks for this work!

Hello Guido,
How close are you to merging this into the Py3k branch? It looks like a solid piece of work, that can only get better in the period between now and the release of 3.2. But I don't want to rush you, and I only have had a brief look at your code.
The code is ready. Priority requests are already disabled, I'm just wondering whether to remove them from the code, or leave them there in case someone thinks they're useful. I suppose removing them is ok.
My only suggestion so far: the description could use more explicit documentation on the various variables and macros and how they combine.
Is it before or after http://mail.python.org/pipermail/python-checkins/2009-November/087482.html ?
Regards
Antoine.

On Sat, Nov 7, 2009 at 9:01 AM, Antoine Pitrou solipsis@pitrou.net wrote:
Hello Guido,
How close are you to merging this into the Py3k branch? It looks like a solid piece of work, that can only get better in the period between now and the release of 3.2. But I don't want to rush you, and I only have had a brief look at your code.
The code is ready. Priority requests are already disabled, I'm just wondering whether to remove them from the code, or leave them there in case someone thinks they're useful. I suppose removing them is ok.
I would remove them -- if and when we find there's a need for something like them I suspect it won't look like what you currently have, so it's better not to complexificate your current patch with them. (I would remove all traces, including the conditions passed into the end macros.)
My only suggestion so far: the description could use more explicit documentation on the various variables and macros and how they combine.
Is it before or after http://mail.python.org/pipermail/python-checkins/2009-November/087482.html ?
After. While that is already really helpful, not all the code is easily linked back to paragraphs in that comment block, and some variables are not mentioned by name in the block.

Hello again,
I've now removed priority requests, tried to improve the internal doc a bit, and merged the changes into py3k.
Afterwards, the new Windows 7 buildbot has hung in test_multiprocessing, but I don't know whether it's related.
Regards
Antoine.
Guido van Rossum <guido <at> python.org> writes:
I would remove them -- if and when we find there's a need for something like them I suspect it won't look like what you currently have, so it's better not to complexificate your current patch with them. (I would remove all traces, including the conditions passed into the end macros.)
My only suggestion so far: the description could use more explicit documentation on the various variables and macros and how they combine.
Is it before or after http://mail.python.org/pipermail/python-checkins/2009-
November/087482.html ?
After. While that is already really helpful, not all the code is easily linked back to paragraphs in that comment block, and some variables are not mentioned by name in the block.

2009/11/7 Antoine Pitrou solipsis@pitrou.net
[...] So, to sum it up, the way the current GIL manages to have good latencies is by issueing an unreasonable number of system calls on a contended lock, and potentially killing throughput performance (this depends on the OS too, because numbers under Linux are not so catastrophic).
Ah, I remember reading this in the analysis that was published now!
I made another benchmark using one of my script I ported to python 3 (and simplified a bit). I only test the total execution time. Tests done on Windows XP SP3. Processor is an Intel Core 2 Quad Q9300 (4 cores).
You can get the script from: http://gaiacrtn.free.fr/py/benchmark-kwd-newgil/purekeyword-py26-3k.py Script + test doc (940KB): http://gaiacrtn.free.fr/py/benchmark-kwd-newgil/benchmark-kwd-newgil.tar.bz2
The threaded loop is: for match in self.punctuation_pattern.finditer( document ): word = document[last_start_index:match.start()] if len(word) > 1 and len(word) < MAX_KEYWORD_LENGTH: words[word] = words.get(word, 0) + 1 last_start_index = match.end() if word: word_count += 1
Here are the results:
-j0 (main thread only) 2.5.2 : 17.991s, 17.947s, 17.780s 2.6.2 : 19.071s, 19.023s, 19.054s 3.1.1 : 46.384s, 46.321s, 46.425s newgil: 47.483s, 47.605s, 47.512s
-j4 (4 consumer threads, main thread producing/waiting) 2.5.2 : 31.105s, 30.568s 2.6.2 : 31.550s, 30.599s 3.1.1 : 85.114s, 85.185s newgil: 48.428s, 49.217s
It shows that, on my platform for this specific benchmark:
- newgil manage to leverage a significant amount of parallelism (1.7) where python 3.1 does not (3.1 is 80% slower) - newgil as a small impact on non multi-threaded execution (~1-2%) [may be worth investigating] - 3.1 is more than 2 times slower than python 2.6 on this benchmark - 2.6 is "only" 65% slower when run with multiple threads compared to the 80% slower of 3.1.
Newgil is a vast improvement as it manages to leverage the short time the GIL is released by finditer [if I understood correctly in 3.x regex release the GIL].
What's worry me is the single threaded performance degradation between 2.6 and 3.1 on this test. Could the added GIL release/acquire on each finditer call explain this?

Hello again,
It shows that, on my platform for this specific benchmark: * newgil manage to leverage a significant amount of parallelism (1.7) where python 3.1 does not (3.1 is 80% slower)
I think you are mistaken:
-j0 (main thread only) newgil: 47.483s, 47.605s, 47.512s -j4 (4 consumer threads, main thread producing/waiting) newgil: 48.428s, 49.217s
The runtimes are actually the same, so newgil doesn't leverage anything. However, it doesn't degrade performance like 2.x/3.1 does :-)
* newgil as a small impact on non multi-threaded execution (~1-2%) [may be worth investigating]
It goes from very slightly slower to very slightly faster and it is likely to be caused by variations in generated output from the compiler.
* 3.1 is more than 2 times slower than python 2.6 on this benchmark
That's the most worrying outcome I'd say. Are you sure the benchmark really does the same thing? Under 2.6, you should add re.UNICODE to the regular expression flags so as to match the 3.x semantics.
[if I understood correctly in 3.x regex release the GIL].
Unless I've missed something it doesn't, no. This could be a separate opportunity for optimization, if someone wants to take a look at it.
Regards
Antoine.

2009/11/7 Antoine Pitrou solipsis@pitrou.net
Hello again,
It shows that, on my platform for this specific benchmark: * newgil manage to leverage a significant amount of parallelism (1.7) where python 3.1 does not (3.1 is 80% slower)
I think you are mistaken:
-j0 (main thread only) newgil: 47.483s, 47.605s, 47.512s -j4 (4 consumer threads, main thread producing/waiting) newgil: 48.428s, 49.217s
The runtimes are actually the same, so newgil doesn't leverage anything. However, it doesn't degrade performance like 2.x/3.1 does :-)
Ooops, I was comparing to 3.1 -j4 times which make no sense. One would think I wanted to see that result since I though the GIL was released :/. This greatly reduce the interest of this benchmark...
* 3.1 is more than 2 times slower than python 2.6 on this benchmark
That's the most worrying outcome I'd say. Are you sure the benchmark really does the same thing? Under 2.6, you should add re.UNICODE to the regular expression flags so as to match the 3.x semantics.
I've tried, but there is no change in result (the regexp does not use \w & co but specify a lot unicode ranges). All strings are already of unicode type in 2.6.
[if I understood correctly in 3.x regex release the GIL].
Unless I've missed something it doesn't, no.
Hmmm, I was confusing with other modules (bzip2 & hashlib?). Looking back at the result of your benchmark it's obvious. Is there a place where the list of functions releasing the GIL is available? I did not see anything in bz2.compress documentation.

Baptiste Lepilleur <baptiste.lepilleur <at> gmail.com> writes:
I've tried, but there is no change in result (the regexp does not use \w & co but specify a lot unicode ranges). All strings are already of unicode type in 2.6.
No they aren't. You should add "from __future__ import unicode_literals" at the start of your script and run it again.
Hmmm, I was confusing with other modules (bzip2 & hashlib?). Looking back at the result of your benchmark it's obvious. Is there a place where the list of functions releasing the GIL is available? I did not see anything in bz2.compress documentation.
No, there isn't. You'd have to test, or read the source code. But bz2 and zlib, for example, do release the GIL.
Regards
Antoine.

Antoine Pitrou wrote:
Hello there,
The last couple of days I've been working on an experimental rewrite of the GIL. Since the work has been turning out rather successful (or, at least, not totally useless and crashing!) I thought I'd announce it here.
I am curious as to whether the entire mechanism is or can be turned off when not needed -- when there are not threads (other than the main, starting thread)?
tjr

Terry Reedy <tjreedy <at> udel.edu> writes:
I am curious as to whether the entire mechanism is or can be turned off when not needed -- when there are not threads (other than the main, starting thread)?
It is an implicit feature: when no thread is waiting on the GIL, the GIL-holding thread isn't notified and doesn't try to release it at all (in the eval loop, that is; GIL-releasing C extensions still release it).
Note that "no thread is waiting on the GIL" can mean one of two things: - either there is only one Python thread - or the other Python threads are doing things with the GIL released (zlib/bz2 compression, waiting on I/O, sleep()ing, etc.)
So, yes, it automatically "turns itself off".
Regards
Antoine.

Antoine Pitrou skrev:
- priority requests, which is an option for a thread requesting the GIL
to be scheduled as soon as possible, and forcibly (rather than any other threads).
So Python threads become preemptive rather than cooperative? That would be great. :-)
time.sleep should generate a priority request to re-acquire the GIL; and so should all other blocking standard library functions with a time-out.
S.M.

-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Sturla Molden time.sleep should generate a priority request to re-acquire the GIL; and so should all other blocking standard library functions with a time- out.
I don't agree. You have to be very careful with priority. time.sleep() does not promise to wake up in any timely manner, and neither do the timeout functions. Rather, the timeout is a way to prevent infinite wait.
In my experience (from stackless python) using priority wakeup for IO can result in very erratic scheduling when there is much IO going on, every IO trumping another. You should stick to round robin except for very special and carefully analysed cases. K

On Oct 26, 2009, at 10:09 AM, Kristján Valur Jónsson wrote:
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Sturla Molden time.sleep should generate a priority request to re-acquire the GIL; and so should all other blocking standard library functions with a time- out.
I don't agree. You have to be very careful with priority. time.sleep() does not promise to wake up in any timely manner, and neither do the timeout functions. Rather, the timeout is a way to prevent infinite wait.
In my experience (from stackless python) using priority wakeup for IO can result in very erratic scheduling when there is much IO going on, every IO trumping another. You should stick to round robin except for very special and carefully analysed cases.
All the IO tasks can also go in their own round robin so that CPU time is correctly shared among all waiting IO tasks.
IOW, to make sure that all IO tasks get a fair share *in relation to all other IO tasks*.
Tasks can be put into the IO round robin when they "pull the IO alarm" so to speak, so there's no need to decide before-hand which task goes in which round robin pool.
I'm not familiar with this particular code in Python, but I've used this in other systems for years to make sure that IO tasks don't starve the rest of the system and that the most "insistent" IO task doesn't starve all the others.
S

Kristján Valur Jónsson <kristjan <at> ccpgames.com> writes:
In my experience (from stackless python) using priority wakeup for IO can
result in very erratic
scheduling when there is much IO going on, every IO trumping another.
I whipped up a trivial multithreaded HTTP server using socketserver.ThreadingMixin and wsgiref, and used apachebench against it with a reasonable concurrency level (10 requests at once). Enabling/disabling priority requests doesn't seem to make a difference.
Regards
Antoine.

Antoine Pitrou skrev:
- priority requests, which is an option for a thread requesting the GIL
to be scheduled as soon as possible, and forcibly (rather than any other threads). T
Should a priority request for the GIL take a priority number?
- If two threads make a priority requests for the GIL, the one with the higher priority should get the GIL first.
- If a thread with a low priority make a priority request for the GIL, it should not be allowed to "preempt" (take the GIL away from) a higher-priority thread, in which case the priority request would be ignored.
Related issue: Should Python threads have priorities? They are after all real OS threads.
S.M.

Sturla Molden <sturla <at> molden.no> writes:
Antoine Pitrou skrev:
- priority requests, which is an option for a thread requesting the GIL
to be scheduled as soon as possible, and forcibly (rather than any other threads). T
Should a priority request for the GIL take a priority number?
Er, I prefer to keep things simple. If you have lots of I/O you should probably use an event loop rather than separate threads.
Related issue: Should Python threads have priorities? They are after all real OS threads.
Well, precisely they are OS threads, and the OS already assigns them (static or dynamic) priorities. No need to replicate this.
(to answer another notion expressed in another message, there's no "round-robin" scheduling either)
Regards
Antoine.

On Mon, Oct 26, 2009 at 10:58 AM, Antoine Pitrou solipsis@pitrou.netwrote:
Er, I prefer to keep things simple. If you have lots of I/O you should probably use an event loop rather than separate threads.
On Windows, sometimes using a single-threaded event loop is sometimes impossible. WaitForMultipleObjects(), which is the Windows equivalent to select() or poll(), can handle a maximum of only 64 objects.
Do we really need priority requests at all? They seem counter to your desire for simplicity and allowing the operating system's scheduler to do its work.
That said, if a thread's time budget is merely paused during I/O rather than reset, then a thread making frequent (but short) I/O requests cannot starve the system.
-- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC http://stutzbachenterprises.com

Daniel Stutzbach <daniel <at> stutzbachenterprises.com> writes:
Do we really need priority requests at all? They seem counter to your desire for simplicity and allowing the operating system's scheduler to do its work.
No, they can be disabled (removed) if we prefer. With priority requests disabled, latency results becomes less excellent but still quite good.
Running ccbench on a dual core machine gives the following latency results, first with then without priority requets.
--- Latency --- (with prio requests)
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 2 ms.) CPU threads=2: 0 ms. (std dev: 2 ms.) CPU threads=3: 0 ms. (std dev: 2 ms.) CPU threads=4: 0 ms. (std dev: 2 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 3 ms. (std dev: 2 ms.) CPU threads=2: 3 ms. (std dev: 2 ms.) CPU threads=3: 3 ms. (std dev: 2 ms.) CPU threads=4: 4 ms. (std dev: 3 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 2 ms.) CPU threads=1: 0 ms. (std dev: 2 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 2 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.)
--- Latency --- (without prio requests)
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 2 ms.) CPU threads=1: 5 ms. (std dev: 0 ms.) CPU threads=2: 3 ms. (std dev: 3 ms.) CPU threads=3: 9 ms. (std dev: 7 ms.) CPU threads=4: 22 ms. (std dev: 23 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 1 ms.) CPU threads=1: 8 ms. (std dev: 2 ms.) CPU threads=2: 5 ms. (std dev: 4 ms.) CPU threads=3: 21 ms. (std dev: 32 ms.) CPU threads=4: 19 ms. (std dev: 26 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 1 ms.) CPU threads=1: 0 ms. (std dev: 2 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)

On 04:18 pm, daniel@stutzbachenterprises.com wrote:
On Mon, Oct 26, 2009 at 10:58 AM, Antoine Pitrou solipsis@pitrou.netwrote:
Er, I prefer to keep things simple. If you have lots of I/O you should probably use an event loop rather than separate threads.
On Windows, sometimes using a single-threaded event loop is sometimes impossible. WaitForMultipleObjects(), which is the Windows equivalent to select() or poll(), can handle a maximum of only 64 objects.
This is only partially accurate. For one thing, WaitForMultipleObjects calls are nestable. For another thing, Windows also has I/O completion ports which are not limited to 64 event sources. The situation is actually better than on a lot of POSIXes.
Do we really need priority requests at all? They seem counter to your desire for simplicity and allowing the operating system's scheduler to do its work.
Despite what I said above, however, I would also take a default position against adding any kind of more advanced scheduling system here. It would, perhaps, make sense to expose the APIs for controlling the platform scheduler, though.
Jean-Paul

On Oct 26, 2009, at 6:45 PM, exarkun@twistedmatrix.com wrote:
Despite what I said above, however, I would also take a default position against adding any kind of more advanced scheduling system here. It would, perhaps, make sense to expose the APIs for controlling the platform scheduler, though.
I would also like to have an exposed API and optional profiling (optional as in conditional compilation, not as in some sort of profiling 'flag' that slows down non-profiling runs) so that you can see what's going on well enough to use the API to tune a particular platform for a particular workload.
S

On 26Oct2009 22:45, exarkun@twistedmatrix.com exarkun@twistedmatrix.com wrote: | On 04:18 pm, daniel@stutzbachenterprises.com wrote: | >On Mon, Oct 26, 2009 at 10:58 AM, Antoine Pitrou write: | >Do we really need priority requests at all? They seem counter to your | >desire for simplicity and allowing the operating system's | >scheduler to do | >its work. | | Despite what I said above, however, I would also take a default | position against adding any kind of more advanced scheduling system | here. It would, perhaps, make sense to expose the APIs for | controlling the platform scheduler, though.
+1 to both sentences from me.

On Sun, Oct 25, 2009 at 1:22 PM, Antoine Pitrou solipsis@pitrou.net wrote:
Having other people test it would be fine. Even better if you have an actual multi-threaded py3k application. But ccbench results for other OSes would be nice too :-)
My results for an 2.4 GHz Intel Core 2 Duo MacBook Pro (OS X 10.5.8):
Control (py3k @ r75723)
--- Throughput ---
Pi calculation (Python)
threads=1: 633 iterations/s. threads=2: 468 ( 74 %) threads=3: 443 ( 70 %) threads=4: 442 ( 69 %)
regular expression (C)
threads=1: 281 iterations/s. threads=2: 282 ( 100 %) threads=3: 282 ( 100 %) threads=4: 282 ( 100 %)
bz2 compression (C)
threads=1: 379 iterations/s. threads=2: 735 ( 193 %) threads=3: 733 ( 193 %) threads=4: 724 ( 190 %)
--- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 1 ms. (std dev: 1 ms.) CPU threads=2: 1 ms. (std dev: 2 ms.) CPU threads=3: 3 ms. (std dev: 6 ms.) CPU threads=4: 2 ms. (std dev: 3 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 975 ms. (std dev: 577 ms.) CPU threads=2: 1035 ms. (std dev: 571 ms.) CPU threads=3: 1098 ms. (std dev: 556 ms.) CPU threads=4: 1195 ms. (std dev: 557 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 2 ms.) CPU threads=2: 4 ms. (std dev: 5 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 1 ms. (std dev: 4 ms.)
Experiment (newgil branch @ r75723)
--- Throughput ---
Pi calculation (Python)
threads=1: 651 iterations/s. threads=2: 643 ( 98 %) threads=3: 637 ( 97 %) threads=4: 625 ( 95 %)
regular expression (C)
threads=1: 298 iterations/s. threads=2: 296 ( 99 %) threads=3: 288 ( 96 %) threads=4: 287 ( 96 %)
bz2 compression (C)
threads=1: 378 iterations/s. threads=2: 720 ( 190 %) threads=3: 724 ( 191 %) threads=4: 718 ( 189 %)
--- Latency ---
Background CPU task: Pi calculation (Python)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 1 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 1 ms. (std dev: 5 ms.)
Background CPU task: regular expression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 1 ms. (std dev: 0 ms.) CPU threads=2: 2 ms. (std dev: 1 ms.) CPU threads=3: 2 ms. (std dev: 2 ms.) CPU threads=4: 2 ms. (std dev: 1 ms.)
Background CPU task: bz2 compression (C)
CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 2 ms. (std dev: 3 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)
I also ran this through Unladen Swallow's threading microbenchmark, which is a straight copy of what David Beazley was experimenting with (simply iterating over 1000000 ints in pure Python) [1]. "iterative_count" is doing the loops one after the other, "threaded_count" is doing the loops in parallel using threads.
The results below are benchmarking py3k as the control, newgil as the experiment. When it says "x% faster", that is a measure of newgil's performance over py3k's.
With two threads:
iterative_count: Min: 0.336573 -> 0.387782: 13.21% slower # I've run this configuration multiple times and gotten the same slowdown. Avg: 0.338473 -> 0.418559: 19.13% slower Significant (t=-38.434785, a=0.95)
threaded_count: Min: 0.529859 -> 0.397134: 33.42% faster Avg: 0.581786 -> 0.429933: 35.32% faster Significant (t=70.100445, a=0.95)
With four threads:
iterative_count: Min: 0.766617 -> 0.734354: 4.39% faster Avg: 0.771954 -> 0.751374: 2.74% faster Significant (t=22.164103, a=0.95) Stddev: 0.00262 -> 0.00891: 70.53% larger
threaded_count: Min: 1.175750 -> 0.829181: 41.80% faster Avg: 1.224157 -> 0.867506: 41.11% faster Significant (t=161.715477, a=0.95) Stddev: 0.01900 -> 0.01120: 69.65% smaller
With eight threads:
iterative_count: Min: 1.527794 -> 1.447421: 5.55% faster Avg: 1.536911 -> 1.479940: 3.85% faster Significant (t=35.559595, a=0.95) Stddev: 0.00394 -> 0.01553: 74.61% larger
threaded_count: Min: 2.424553 -> 1.677180: 44.56% faster Avg: 2.484922 -> 1.723093: 44.21% faster Significant (t=184.766131, a=0.95) Stddev: 0.02874 -> 0.02956: 2.78% larger
I'd be interested in multithreaded benchmarks with less-homogenous workloads.
Collin Winter
[1] - http://code.google.com/p/unladen-swallow/source/browse/tests/performance/bm_...

Collin Winter <collinw <at> gmail.com> writes:
My results for an 2.4 GHz Intel Core 2 Duo MacBook Pro (OS X 10.5.8):
Thanks!
[the Dave Beazley benchmark]
The results below are benchmarking py3k as the control, newgil as the experiment. When it says "x% faster", that is a measure of newgil's performance over py3k's.
With two threads:
iterative_count: Min: 0.336573 -> 0.387782: 13.21% slower # I've run this configuration multiple times and gotten the same slowdown. Avg: 0.338473 -> 0.418559: 19.13% slower
Those numbers are not very in line with the other "iterative_count" results. Since iterative_count just runs the loop N times in a row, results should be proportional to the number N ("number of threads").
Besides, there's no reason for single-threaded performance to be degraded since the fast path of the eval loop actually got a bit streamlined (there is no volatile ticker to decrement).
I'd be interested in multithreaded benchmarks with less-homogenous workloads.
So would I.
Regards
Antoine.

On Mon, Oct 26, 2009 at 2:43 PM, Antoine Pitrou solipsis@pitrou.net wrote:
Collin Winter <collinw <at> gmail.com> writes: [the Dave Beazley benchmark]
The results below are benchmarking py3k as the control, newgil as the experiment. When it says "x% faster", that is a measure of newgil's performance over py3k's.
With two threads:
iterative_count: Min: 0.336573 -> 0.387782: 13.21% slower # I've run this configuration multiple times and gotten the same slowdown. Avg: 0.338473 -> 0.418559: 19.13% slower
Those numbers are not very in line with the other "iterative_count" results. Since iterative_count just runs the loop N times in a row, results should be proportional to the number N ("number of threads").
Besides, there's no reason for single-threaded performance to be degraded since the fast path of the eval loop actually got a bit streamlined (there is no volatile ticker to decrement).
I agree those numbers are out of line with the others and make no sense. I've run it with two threads several times and the results are consistent on this machine. I'm digging into it a bit more.
Collin

The new GIL does away with this by ditching _Py_Ticker entirely and instead using a fixed interval (by default 5 milliseconds, but settable) after which we ask the main thread to release the GIL and let another thread be scheduled.
I've looked at this part of the implementation, and have a few comments. a) why is gil_interval a double? Make it an int, counting in microseconds. b) notice that, on Windows, minimum wait resolution may be as large as 15ms (e.g. on XP, depending on the hardware). Not sure what this means for WaitForMultipleObjects; most likely, if you ask for a 5ms wait, it waits until the next clock tick. It would be bad if, on some systems, a wait of 5ms would mean that it immediately returns. c) is the gil_drop_request handling thread-safe? If your answer is "yes", you should add a comment as to what the assumptions are of this code (ISTM that multiple threads may simultaneously attempt to set the drop request, overlapping with the holding thread actually dropping the GIL). There is also the question whether it is thread-safe to write into a "volatile int"; I keep forgetting the answer.
Regards, Martin

Martin v. Löwis skrev:
b) notice that, on Windows, minimum wait resolution may be as large as 15ms (e.g. on XP, depending on the hardware). Not sure what this means for WaitForMultipleObjects; most likely, if you ask for a 5ms wait, it waits until the next clock tick. It would be bad if, on some systems, a wait of 5ms would mean that it immediately returns.
Which is why one should use multimedia timers with QPC on Windows.
To get a wait function with much better resolution than Windows' default, do this:
1. Set a high resolution with timeBeginPeriod.
2. Loop using a time-out of 0 for WaitForMultipleObjects and put a Sleep(0) in the loop not to burn the CPU. Call QPF to get a precise timing, and break the loop when the requested time-out has been reached.
3. When you are done, call timeBeginPeriod to turn the multimedia timer off.
This is how you create usleep() in Windows as well: Just loop on QPF and Sleep(0) after setting timeBeginPeriod(1).

Sturla Molden wrote:
Martin v. Löwis skrev:
b) notice that, on Windows, minimum wait resolution may be as large as 15ms (e.g. on XP, depending on the hardware). Not sure what this means for WaitForMultipleObjects; most likely, if you ask for a 5ms wait, it waits until the next clock tick. It would be bad if, on some systems, a wait of 5ms would mean that it immediately returns.
Which is why one should use multimedia timers with QPC on Windows.
Maybe you should study the code under discussion before making such a proposal.
Regards, Martin

Martin v. Löwis skrev:
Maybe you should study the code under discussion before making such a proposal.
I did, and it does nothing of what I suggested. I am sure I can make the Windows GIL in ceval_gil.h and the mutex in thread_nt.h at lot more precise and efficient.
This is the kind of code I was talking about, from ceval_gil.h:
r = WaitForMultipleObjects(2, objects, TRUE, milliseconds);
I would turn on multimedia timer (it is not on by default), and replace this call with a loop, approximately like this:
for (;;) { r = WaitForMultipleObjects(2, objects, TRUE, 0); /* blah blah blah */ QueryPerformanceCounter(&cnt); if (cnt > timeout) break; Sleep(0); }
And the timeout "milliseconds" would now be computed from querying the performance counter, instead of unreliably by the Windows NT kernel.
Sturla

Sturla Molden <sturla <at> molden.no> writes:
And the timeout "milliseconds" would now be computed from querying the performance counter, instead of unreliably by the Windows NT kernel.
Could you actually test your proposal under Windows and report what kind of concrete benefits it brings?
Thank you
Antoine.

Sturla Molden skrev:
I would turn on multimedia timer (it is not on by default), and replace this call with a loop, approximately like this:
for (;;) { r = WaitForMultipleObjects(2, objects, TRUE, 0); /* blah blah blah */ QueryPerformanceCounter(&cnt); if (cnt > timeout) break; Sleep(0); }
And just so you don't ask: There should not just be a Sleep(0) in the loop, but a sleep that gets shorter and shorter until a lower threshold is reached, where it skips to Sleep(0). That way we avoid hammering om WaitForMultipleObjects and QueryPerformanceCounter more than we need. And for all that to work better than just giving a timeout to WaitForMultipleObjects, we need the multimedia timer turned on.
Sturla

Sturla Molden skrev:
And just so you don't ask: There should not just be a Sleep(0) in the loop, but a sleep that gets shorter and shorter until a lower threshold is reached, where it skips to Sleep(0). That way we avoid hammering om WaitForMultipleObjects and QueryPerformanceCounter more than we need. And for all that to work better than just giving a timeout to WaitForMultipleObjects, we need the multimedia timer turned on.
The important thing about multimedia timer is that the granularity of Sleep() and WaitForMultipleObjects() by default is "10 ms or at most 20 ms". But if we call
timeBeginPeriod(1);
the MM timer is on and granularity becomes 1 ms or at most 2 ms. But we can get even more precise than that by hammering on Sleep(0) for timeouts less than 2 ms. We can get typical granularity in the order of 10 µs, with the occational 100 µs now and then. I know this because I was using Windows 2000 to generate TTL signals on the LPT port some years ago, and watched them on the oscilloscope.
~ 15 ms granularity is Windows default. But that is brain dead.
By the way Antoine, if you think granularity of 1-2 ms is sufficient, i.e. no need for µs precision, then just calling timeBeginPeriod(1) will be sufficient.
Sturla

Sturla Molden <sturla <at> molden.no> writes:
By the way Antoine, if you think granularity of 1-2 ms is sufficient,
It certainly is. But once again, I'm no Windows developer and I don't have a native Windost host to test on; therefore someone else (you?) has to try.
Also, the MSDN doc (*) says timeBeginPeriod() can have a detrimental effect on system performance; I don't know how much of it is true.
(*) http://msdn.microsoft.com/en-us/library/dd757624(VS.85).aspx

Antoine Pitrou skrev:
It certainly is. But once again, I'm no Windows developer and I don't have a native Windost host to test on; therefore someone else (you?) has to try.
I'd love to try, but I don't have VC++ to build Python, I use GCC on Windows.
Anyway, the first thing to try then is to call
timeBeginPeriod(1);
once on startup, and leave the rest of the code as it is. If 2-4 ms is sufficient we can use timeBeginPeriod(2), etc. Microsoft is claiming Windows performs better with high granularity, which is why it is 10 ms by default.
Sturla

Sturla Molden wrote:
Antoine Pitrou skrev:
It certainly is. But once again, I'm no Windows developer and I don't have a native Windost host to test on; therefore someone else (you?) has to try.
I'd love to try, but I don't have VC++ to build Python, I use GCC on Windows.
Anyway, the first thing to try then is to call
timeBeginPeriod(1);
once on startup, and leave the rest of the code as it is. If 2-4 ms is sufficient we can use timeBeginPeriod(2), etc. Microsoft is claiming Windows performs better with high granularity, which is why it is 10 ms by default.
Sturla
That page claims: Windows uses the lowest value (that is, highest resolution) requested by any process.
I would posit that the chance of having some random process on your machine request a high-speed timer is high enough that the overhead for Python doing the same is probably low.
John =:->

I did, and it does nothing of what I suggested. I am sure I can make the Windows GIL in ceval_gil.h and the mutex in thread_nt.h at lot more precise and efficient.
Hmm. I'm skeptical that your code makes it more accurate, and I completely fail to see that it makes it more efficient (by what measurement of efficiency?)
Also, why would making it more accurate make it better? IIUC, accuracy is completely irrelevant here, though efficiency (low overhead) does matter.
This is the kind of code I was talking about, from ceval_gil.h:
r = WaitForMultipleObjects(2, objects, TRUE, milliseconds);
I would turn on multimedia timer (it is not on by default), and replace this call with a loop, approximately like this:
for (;;) { r = WaitForMultipleObjects(2, objects, TRUE, 0); /* blah blah blah */ QueryPerformanceCounter(&cnt); if (cnt > timeout) break; Sleep(0); }
And the timeout "milliseconds" would now be computed from querying the performance counter, instead of unreliably by the Windows NT kernel.
Hmm. This creates a busy wait loop; if you add larger sleep values, then it loses accuracy.
Why not just call timeBeginPeriod, and then rely on the higher clock rate for WaitForMultipleObjects?
Regards, Martin

On Mon, Nov 2, 2009 at 11:27 AM, "Martin v. Löwis" martin@v.loewis.dewrote:
Hmm. This creates a busy wait loop; if you add larger sleep values, then it loses accuracy.
I thought that at first, too, but then I checked the documentation for Sleep(0):
"A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run."
(this is not to say that I think the solution with Sleep is worthwhile, though...)
-- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC http://stutzbachenterprises.com

Martin v. Löwis skrev:
I did, and it does nothing of what I suggested. I am sure I can make the Windows GIL in ceval_gil.h and the mutex in thread_nt.h at lot more precise and efficient.
Hmm. I'm skeptical that your code makes it more accurate, and I completely fail to see that it makes it more efficient (by what measurement of efficiency?)
Also, why would making it more accurate make it better? IIUC, accuracy is completely irrelevant here, though efficiency (low overhead) does matter.
This is the kind of code I was talking about, from ceval_gil.h:
r = WaitForMultipleObjects(2, objects, TRUE, milliseconds);
I would turn on multimedia timer (it is not on by default), and replace this call with a loop, approximately like this:
for (;;) { r = WaitForMultipleObjects(2, objects, TRUE, 0); /* blah blah blah */ QueryPerformanceCounter(&cnt); if (cnt > timeout) break; Sleep(0); }
And the timeout "milliseconds" would now be computed from querying the performance counter, instead of unreliably by the Windows NT kernel.
Hmm. This creates a busy wait loop; if you add larger sleep values, then it loses accuracy.
Actually an usleep lookes like this, and the call to the wait function must go into the for loop. But no, it's not a busy sleep.
static int inited = 0; static __int64 hz; static double dhz; const double sleep_granularity = 2.0E10-3;
void usleep( long us ) { __int64 cnt, end; double diff; if (!inited) { timeBeginPeriod(1); QueryPerformanceFrequency((LARGE_INTEGER*)&hz); dhz = (double)hz; inited = 1; } QueryPerformanceCounter((LARGE_INTEGER*)&cnt); end = cnt + (__int64)(1.0E10-6 * (double)(us) * dhz); for (;;) { QueryPerformanceCounter((LARGE_INTEGER*)&cnt); if (cnt >= end) break; diff = (double)(end - cnt)/dhz; if (diff > sleep_granularity) Sleep((DWORD)(diff - sleep_granularity)); else Sleep(0); } }
Why not just call timeBeginPeriod, and then rely on the higher clock rate for WaitForMultipleObjects?
That is what I suggested when Antoine said 1-2 ms was enough.
Sturla

Martin v. Löwis <martin <at> v.loewis.de> writes:
I've looked at this part of the implementation, and have a few comments. a) why is gil_interval a double? Make it an int, counting in microseconds.
Ok, I'll do that.
b) notice that, on Windows, minimum wait resolution may be as large as 15ms (e.g. on XP, depending on the hardware). Not sure what this means for WaitForMultipleObjects; most likely, if you ask for a 5ms wait, it waits until the next clock tick. It would be bad if, on some systems, a wait of 5ms would mean that it immediately returns.
I'll let someone else give an authoritative answer. I did the Windows version mainly by reading online MSDN docs, and testing it worked fine in an XP VM.
Anyway, here is what the online docs have to say :
« The system clock "ticks" at a constant rate. If the time-out interval is less than the resolution of the system clock, the wait may time out in less than the specified length of time. If the time-out interval is greater than one tick but less than two, the wait can be anywhere between one and two ticks, and so on. »
So it seems that the timeout always happens on a Windows tick boundary, which means that it can immediately return, but only very rarely so and never more than once in a row...
c) is the gil_drop_request handling thread-safe? If your answer is "yes", you should add a comment as to what the assumptions are of this code (ISTM that multiple threads may simultaneously attempt to set the drop request, overlapping with the holding thread actually dropping the GIL).
The gil_drop_request is volatile mainly out of precaution, but it's probably not necessary. It is only written (set/cleared) when holding the gil_mutex; however, it can be read while not holding that mutex. Exceptionally reading the "wrong" value would not have any adverse consequences, it would just decrease the switching quality at the said instant.
Regards
Antoine.

c) is the gil_drop_request handling thread-safe? If your answer is "yes", you should add a comment as to what the assumptions are of this code (ISTM that multiple threads may simultaneously attempt to set the drop request, overlapping with the holding thread actually dropping the GIL).
The gil_drop_request is volatile mainly out of precaution, but it's probably not necessary. It is only written (set/cleared) when holding the gil_mutex; however, it can be read while not holding that mutex. Exceptionally reading the "wrong" value would not have any adverse consequences, it would just decrease the switching quality at the said instant.
I think it then needs definitely to be volatile - otherwise, the compiler may chose to cache it in a register (assuming enough registers are available), instead of re-reading it from memory each time (which is exactly what volatile then forces it to).
Even if it is read from memory, I still wonder what might happen on systems that require explicit memory barriers to synchronize across CPUs. What if CPU 1 keeps reading a 0 value out of its cache, even though CPU 1 has written an 1 value a long time ago?
IIUC, any (most?) pthread calls would cause synchronization in that case, which is why applications that also use locks for reading:
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_0...
Of course, on x86, you won't see any issues, because it's cache-coherent anyway.
Regards, Martin

Martin v. Löwis <martin <at> v.loewis.de> writes:
[gil_drop_request]
Even if it is read from memory, I still wonder what might happen on systems that require explicit memory barriers to synchronize across CPUs. What if CPU 1 keeps reading a 0 value out of its cache, even though CPU 1 has written an 1 value a long time ago?
IIUC, any (most?) pthread calls would cause synchronization in that case, which is why applications that also use locks for reading:
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_0...
Of course, on x86, you won't see any issues, because it's cache-coherent anyway.
I think there are two things here: - all machines Python runs on should AFAIK be cache-coherent: CPUs synchronize their views of memory in a rather timely fashion. - memory ordering: writes made by a CPU can be seen in a different order by another CPU (i.e. CPU 1 writes A before B, but CPU 2 sees B written before A). I don't see how this can apply to gil_drop_request, since it's a single variable (and, moreover, only a single bit of it is significant).
(there's an explanation of memory ordering issues here: http://www.linuxjournal.com/article/8211)
As a side note, I remember Jeffrey Yasskin trying to specify an ordering model for Python code (see http://code.google.com/p/unladen-swallow/wiki/MemoryModel).
Regards
Antoine.

- all machines Python runs on should AFAIK be cache-coherent: CPUs synchronize
their views of memory in a rather timely fashion.
Ok. I thought that Itanium was an example where this assumption is actually violated (as many web pages claim such a restriction), however, it seems that on Itanium, caches are indeed synchronized using MESI.
So claims wrt. lack of cache consistency on Itanium, and the need for barrier instruction, seem to be caused by the Itanium feature that allows the processor to fetch memory out-of-order, i.e. an earlier read may see a later memory state. This is apparently used to satisfy reads as soon as the cache line is read (so that the cache line can be discarded earlier). Wrt. to your requirement ("rather timely fashion", this still seems to be fine).
Still, this all needs to be documented in the code :-)
Regards, Martin

On Mon, Nov 2, 2009 at 4:15 AM, Antoine Pitrou solipsis@pitrou.net wrote:
Martin v. Löwis <martin <at> v.loewis.de> writes:
[gil_drop_request]
Even if it is read from memory, I still wonder what might happen on systems that require explicit memory barriers to synchronize across CPUs. What if CPU 1 keeps reading a 0 value out of its cache, even though CPU 1 has written an 1 value a long time ago?
IIUC, any (most?) pthread calls would cause synchronization in that case, which is why applications that also use locks for reading:
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_0...
Of course, on x86, you won't see any issues, because it's cache-coherent anyway.
I think there are two things here:
- all machines Python runs on should AFAIK be cache-coherent: CPUs synchronize
their views of memory in a rather timely fashion.
- memory ordering: writes made by a CPU can be seen in a different order by
another CPU (i.e. CPU 1 writes A before B, but CPU 2 sees B written before A). I don't see how this can apply to gil_drop_request, since it's a single variable (and, moreover, only a single bit of it is significant).
(there's an explanation of memory ordering issues here: http://www.linuxjournal.com/article/8211)
As a side note, I remember Jeffrey Yasskin trying to specify an ordering model for Python code (see http://code.google.com/p/unladen-swallow/wiki/MemoryModel).
Note that that memory model was only for Python code; the C code implementing it is subject to the C memory model, which is weaker (and not even fully defined until the next C standard comes out).
To be really safe, we ought to have a couple primitives implementing "atomic" rather than just "volatile" instructions, but until then a signal that's just saying "do something" rather than "here's some data you should look at" should be ok as a volatile int.
I'd like to look at the patch in detail, but I can't guarantee that I'll get to it in a timely manner. I'd say check it in and let more threading experts look at it in the tree. We've got some time before a release for people to fix problems and make further improvements. +1 to Martin's request for detailed documentation though. :)

Hello,
I built something very similar for my company last year, and it’s been running flawlessly in production at a few customer sites since, with avg. CPU usage ~50% around the clock. I even posted about it on the Python mailing list [1] where there was almost no resonance at that time. I never posted code, though -- nobody seemed to be too interested.
I am well aware that your current work is a lot more far-reaching than what I’ve done, which is basically just a FIFO scheduler. I even added scheduling priorities later which don’t work too great because the amount of time used for a "tick" can vary by several orders of magnitude, as you know.
Thought you might be interested.
Regards Stefan
[1] http://mail.python.org/pipermail/python-dev/2008-March/077814.html [2] http://www.bestinclass.dk/index.php/2009/10/python-vs-clojure-evolving/ [3] www.dabeaz.com/python/GIL.pdf
PS On a slightly different note, I came across some Python bashing [2] yesterday and somehow from there to David Beazley’s presentation about the GIL [3]. While I don’t mind the bashing, the observations about the GIL seem quite unfair to me because David’s measurements have been made on Mac OS X with its horribly slow pthreads functions. I was not able to measure any slowdown on Linux.

Stefan Ring wrote:
[2] http://www.bestinclass.dk/index.php/2009/10/python-vs-clojure-evolving/ [3] www.dabeaz.com/python/GIL.pdf
PS On a slightly different note, I came across some Python bashing [2] yesterday and somehow from there to David Beazley’s presentation about the GIL [3]. While I don’t mind the bashing, the observations about the GIL seem quite unfair to me because David’s measurements have been made on Mac OS X with its horribly slow pthreads functions. I was not able to measure any slowdown on Linux.
We care about Mac OS X though, so even if the contention wasn't as bad on a different OS, the Mac downsides matter.
With the GIL updates in place, it would be interesting to see that analysis redone for 2.7/3.2 though.
Regards, Nick.
P.S. As far as interest in the idea goes, the GIL is one of those areas where it takes a fairly rare combination of interest, expertise and established credibility to propose a change and get assent to it. You'll notice that even Antoine had to resort to the "if nobody objects soon, I'm checking this in" tactic to garner any responses. It's an area where even those with relevant expertise still have to put aside a fair chunk of time in order to properly review a proposed change :)

I built something very similar for my company last year, and it’s been running flawlessly in production at a few customer sites since, with avg. CPU usage ~50% around the clock. I even posted about it on the Python mailing list [1] where there was almost no resonance at that time. I never posted code, though -- nobody seemed to be too interested.
I've never bothered to make this tidy and nice, especially the function naming (PySpecial_*) leaves some things to be desired. It's not too bad, though; it just doesn't have commit-ready quality. I don't worry about this anymore, so I just post what I have. Maybe someone can make use of it.
participants (20)
-
"Martin v. Löwis"
-
Andrew MacIntyre
-
Antoine Pitrou
-
Baptiste Lepilleur
-
Brett Cannon
-
Cameron Simpson
-
Christian Heimes
-
Collin Winter
-
Daniel Stutzbach
-
exarkun@twistedmatrix.com
-
Gregory P. Smith
-
Guido van Rossum
-
Jeffrey Yasskin
-
John Arbash Meinel
-
Kristján Valur Jónsson
-
Nick Coghlan
-
ssteinerX@gmail.com
-
Stefan Ring
-
Sturla Molden
-
Terry Reedy