
Hi all, Here is a second (last?) attempt at getting traction on fixing the GIL (is it broken?) with BFS (WTF?). So don't be shy (don't be too rude either) since ignoring counts as down voting. Relevant Python issue: http://bugs.python.org/issue7946 *Bottom line first* I submitted an implementation of BFS ( http://ck.kolivas.org/patches/bfs/sched-BFS.txt) as a patch to the GIL, which to the extent I have tested it, behaves nicely on Windows XP, Windows 7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4 cores, laptop, desktop and VirtualBox VM guest (some data below). The patch is still work in progress and requires work in terms of style, moving code where it belongs, test code, etc... nevertheless, Python core developers recommended I already (re)post to python-dev for discussion. *So is the GIL broken?* There seems to be some disagreement on that question among Python core developers (unless you all agree it is not broken :) ). Some developers maintain the effects described by David Beazley do not affect real world systems. Even I took the role of a devil's advocate in a previous discussion, but in fact I think that Python, being a general purpose language, is similar to the OS in that regard. It is used across many application domains, platforms, and development paradigms, just as OS schedulers are, and therefore accepting thread scheduling with such properties as a fact of life is not a good idea. I was first bitten by the original GIL last year while testing a system, and found David's research while looking for answers, and later had to work around that problem in another system. Here are other real world cases: 1) Zope people hit this back in 2002 and documented the problem with interesting insight: http://www.zope.org/Members/glpb/solaris/multiproc "I have directly observed a 30% penalty under MP constraints when the sys.setcheckinterval value was too low (and there was too much GIL thrashing)." http://www.zope.org/Members/glpb/solaris/report_ps "A machine that's going full-throttle isn't as bad, curiously enough -- because the other CPU's are busy doing real work, the GIL doesn't have as much opportunity to get shuffled between CPUs. On a MP box it's very important to set sys.setcheckinterval() up to a fairly large number, I recommend pystones / 50 or so." 2) Python mailing list - 2005 http://mail.python.org/pipermail/python-list/2005-August/336286.html "The app suffers from serious performance degradation (compared to pure c/C++) and high context switches that I suspect the GIL unlocking may be aggravating ?" 3) Python mailing list - 2008 http://mail.python.org/pipermail/python-list/2008-June/1143217.html "When I start the server, it sometimes eats up 100% of the CPU for a good minute or so... though none of the threads are CPU-intensive" 4) Twisted http://twistedmatrix.com/pipermail/twisted-python/2005-July/011048.html "When I run a CPU intensive method via threads.deferToThread it takes all the CPU away and renders the twisted process unresponsive." Admittedly, it is not easy to dig reports up in Google. Finally, I think David explained the relevance of this problem quite nicely: http://mail.python.org/pipermail/python-dev/2010-March/098416.html *What about the new GIL?* There is no real world experience with the new GIL since it is under development. What we have is David's analysis and a few benchmarks from the bug report. *Evolving the GIL into a scheduler* The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it. The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios (see data below). *BFS* Enter BFS, a new scheduler designed by Con Kolivas, a Linux kernel hacker who is an expert in this field: http://ck.kolivas.org/patches/bfs/sched-BFS.txt "The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to completely do away with the complex designs of the past for the cpu process scheduler and instead implement one that is very simple in basic design. The main focus of BFS is to achieve excellent desktop interactivity and responsiveness without heuristics and tuning knobs that are difficult to understand, impossible to model and predict the effect of, and when tuned to one workload cause massive detriment to another." I submitted an implementation of BFS (bfs.patch) which on my machines gives comparable performance to gilinter2.patch (Antoine's) and seems to schedule threads more fairly, predictably, and with lower rate of context switching (see data below). There are however, some issues in bfs.patch: 1) It works on top of the OS scheduler, which means (for all GIL patches!): a) It does not control and is not aware of information such as OS thread preemption, CPU core to run on, etc... b) There may be hard to predict interaction between BFS and the underlying OS scheduler, which needs to be tested on each target platform. 2) It works best when TSC (http://en.wikipedia.org/wiki/Time_Stamp_Counter) is available and otherwise falls back to gettimeofday(). I expect the scheduler to misbehave to some degree or affect performance when TSC is not available and either of the following is true: a) if gettimeofday() is very expensive to read (impacts release/acquire overhead). b) if gettimeofday() has very low precision ~10ms. By design of BFS, once CPU load crosses a given threshold (about 8 CPU bound tasks which need the CPU at once), the scheduler falls back to FIFO behavior and latency goes up sharply. I have no data on how bfs.patch behaves on ARM, AMD, old CPU models, OSX, FreeBSD, Solaris, or mobile. The patch may require some tuning to work properly on those systems, so data is welcome (make sure TSC code in Include/cycle.h works on those systems before benching). All that said, to the extent I have tested it, bfs.patch behaves nicely on Windows XP, Windows 7, GNU/Linux with either CFS or O(1) schedulers, 1/2/4 cores, laptop, desktop and VirtualBox VM guest. *Data* Comparison of proposed patches running ccbench on Windows XP: http://bugs.python.org/issue7946#msg104899 Comparison of proposed patches running Florent's write.py test on Ubuntu Karmic: http://bugs.python.org/issue7946#msg105687 Comparison of old GIL, new GIL and BFS running ccbench on Ubuntu Karmic: http://bugs.python.org/issue7946#msg105874 Last comparison includes a run of old GIL with sys.setcheckinterval(2500) as Zope people do. IO latency shoots up to ~1000ms as result. *What can be done with it?* Here are some options: 1) Abandon it - no one is interested, yawn. 2) Take ideas and workarounds from its code and apply to other patches. 3) Include it in the interpreter as an auxiliary (turn on with a runtime switch) scheduler. 4) Adopt it as the Python scheduler. *Opinion?* Your opinion is needed (however, please submit code review comments which are not likely to interest other people, e.g. "why did you use volatile for X?", at the issue page: http://bugs.python.org/issue7946). Thanks, Nir

I'm interested in having *something*, but I'm not particularly interested in the 3.x branch. I expect it to be at least 5 years before the various Python subsystems I depend on are all ported to 3.x, and I expect it to be at least that long before I can move onto a version of OS X that ships with Python 3 as the default system Python. Right now, I'm faced with the prospect of Apple's next OS including Python 2.7, and being saddled with the current Python 2.x terrible multicore behavior for the next 5 years or so. We badly need some kind of patch for this in the 2.x branch. Bill

Bill Janssen wrote:
Right now, I'm faced with the prospect of Apple's next OS including Python 2.7, and being saddled with the current Python 2.x terrible multicore behavior for the next 5 years or so. We badly need some kind of patch for this in the 2.x branch.
The matter of the GIL seems far less urgent to those of us that don't see threading as a particularly good way to exploit multiple cores. Either way, with the first 2.7 release candidate out soon, it's already too late to contemplate significant changes to the GIL for that release. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan <ncoghlan@gmail.com> wrote:
Bill Janssen wrote:
Right now, I'm faced with the prospect of Apple's next OS including Python 2.7, and being saddled with the current Python 2.x terrible multicore behavior for the next 5 years or so. We badly need some kind of patch for this in the 2.x branch.
The matter of the GIL seems far less urgent to those of us that don't see threading as a particularly good way to exploit multiple cores.
Nick, this isn't about exploiting cores. This is about Python programs that used to work fine on single-core machines suddenly becoming horrible resource hogs when moved to a more modern machine with a more modern version of Python. As far as I'm concerned, just tying all of the program's threads to a single core would be fine, though I imagine others would differ.
Either way, with the first 2.7 release candidate out soon, it's already too late to contemplate significant changes to the GIL for that release.
The release schedule, and labelling things as "release candidates" or not, are all under our control. Nothing is "too late". And there's always Python 2.8 :-). But I'd consider this a bug in the threading library, not some unmotivated blue-sky change to the GIL. Bill

Bill Janssen wrote:
As far as I'm concerned, just tying all of the program's threads to a single core would be fine, though I imagine others would differ.
Which can be done through the OS tools for setting an application's CPU affinity. Fixing the Python thread scheduling is only necessary if we want to be able to exploit the extra power of those cores rather than forcing reversion to single core behaviour. Note that I'm not *opposed* to fixing it, and the discussion in the tracker issue over Nir and Dave's different approaches to the problem looks interesting.
The release schedule, and labelling things as "release candidates" or not, are all under our control. Nothing is "too late". And there's always Python 2.8 :-) . But I'd consider this a bug in the threading library, not some unmotivated blue-sky change to the GIL.
Yes, but if we never said "too late" we'd never ship anything :) And you do have a reasonable case for considering this a bug, but it wouldn't be the first time we've escalated bug fixes to "new feature" level simply because they had a relatively high impact on core parts of the code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Nick Coghlan <ncoghlan@gmail.com> wrote:
Bill Janssen wrote:
As far as I'm concerned, just tying all of the program's threads to a single core would be fine, though I imagine others would differ.
Which can be done through the OS tools for setting an application's CPU affinity.
Yes, as I say, if the initialization of the threading module called those tools appropriately and automatically, I'd be happy. Well, less unhappy :-). I have to admit, I don't know how to do this on OS X. What's the tool and the process, if we're not getting too far afield? Bill

Bill Janssen wrote:
Nick Coghlan <ncoghlan@gmail.com> wrote:
Bill Janssen wrote:
As far as I'm concerned, just tying all of the program's threads to a single core would be fine, though I imagine others would differ. Which can be done through the OS tools for setting an application's CPU affinity.
Yes, as I say, if the initialization of the threading module called those tools appropriately and automatically, I'd be happy. Well, less unhappy :-).
I have to admit, I don't know how to do this on OS X. What's the tool and the process, if we're not getting too far afield?
OSX doesn't really support thread affinity. The affinity API that they have is described on http://developer.apple.com/mac/library/releasenotes/Performance/RN-AffinityA... You can't bind a thread to specific core with it, though, but you can requests that multiple threads all run on the same core (leaving the choice "which core" to the system). IIUC, an affinity preference does not survive exec(2), so you can't write a tool that binds all threads of its child processes to a single core (such a tool is available on many unices, though). Regards, Martin

Martin v. Löwis <martin@v.loewis.de> wrote:
OSX doesn't really support thread affinity. The affinity API that they have is described on
http://developer.apple.com/mac/library/releasenotes/Performance/RN-AffinityA...
You can't bind a thread to specific core with it, though, but you can requests that multiple threads all run on the same core (leaving the choice "which core" to the system).
I believe that would be sufficient to fix the problem with Python, though I wonder about the effect on JCC-generated modules like pylucene, where the threads are really Java threads as well as Python threads. So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set. Presumably there would also be a way to clear that binding, for folks who know what they're doing. Bill

On Sun, 16 May 2010 15:13:44 PDT Bill Janssen <janssen@parc.com> wrote:
So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set.
This is not really a good idea. There's some code which releases the GIL, precisely so that you can run several threads (computations) at once. If you aren't too concerned with interactivity, you can increase sys.setcheckinterval() to alleviate the problem on 2.x. Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 16 May 2010 15:13:44 PDT Bill Janssen <janssen@parc.com> wrote:
So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set.
This is not really a good idea.
Yes, fixing the GIL for multicore in 2.7 would be a better idea, IMO. But that might be too large a change.
There's some code which releases the GIL, precisely so that you can run several threads (computations) at once.
If they can get hold of the GIL in the first place! Yes, you'd want to be able to "unbind" threads if you knew what you were doing, so that they could run on other cores, and you'd want a switch to disable the affinity mechanism entirely. But I'd be happy to have things in the naive case run as well as they do on single-core machines, and let experts do optimizations.
If you aren't too concerned with interactivity, you can increase sys.setcheckinterval() to alleviate the problem on 2.x.
Unfortunately, many use cases might well be concerned with interactivity. Things like event-processing loops. Bill

Le lundi 17 mai 2010 à 09:05 -0700, Bill Janssen a écrit :
Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 16 May 2010 15:13:44 PDT Bill Janssen <janssen@parc.com> wrote:
So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set.
This is not really a good idea.
Yes, fixing the GIL for multicore in 2.7 would be a better idea, IMO. But that might be too large a change.
Well, 2.7rc1 is scheduled in less than three weeks now. IMO any patch changing fundamental threading properties is a no-no (even the processor affinity proposal). Someone had tried to backport the new GIL to 2.7 (there's a tracker issue for it, I'm too lazy to search right now), but it was concluded that compatibility requirements for Python 2.x (compatibility with various legacy threading libraries) made things too complicated and tedious.
There's some code which releases the GIL, precisely so that you can run several threads (computations) at once.
If they can get hold of the GIL in the first place! Yes, you'd want to be able to "unbind" threads if you knew what you were doing, so that they could run on other cores, and you'd want a switch to disable the affinity mechanism entirely. But I'd be happy to have things in the naive case run as well as they do on single-core machines, and let experts do optimizations.
"Letting experts do optimizations" is a regression, though, because right now you don't have to be an expert to take advantage of such optimizations (just run a separate thread doing e.g. some zlib compression). Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
Well, 2.7rc1 is scheduled in less than three weeks now. IMO any patch changing fundamental threading properties is a no-no (even the processor affinity proposal).
Unfortunately, our "fundamental threading properties" are broken for multicore machines. And multicore seems to be the wave of the future. Not fixing this is a big problem. It relegates the only Python which will be usable by many many people for many years, 2.x, to a toy language status on modern machines. It will have threads, but the use of them, either directly or indirectly by modules you may import, may cause unpredictable performance problems. I'd hate to let a fundamental flaw like this go through simply because someone somewhere somewhen set a completely synthetic deadline.
Someone had tried to backport the new GIL to 2.7 (there's a tracker issue for it, I'm too lazy to search right now), but it was concluded that compatibility requirements for Python 2.x (compatibility with various legacy threading libraries) made things too complicated and tedious.
http://bugs.python.org/issue7753, from January. I've read through that issue (several times), and I have to say that I wind up gnashing my teeth each time. Here's a fix that's rejected because it "only" supports NT and POSIX threads. What percentage of Python use cases do those two threading systems cover? Do we know? If by "compatibility requirements" you mean that new releases of Python should run on antique systems just as the older releases did, that's satisfied by the issue's current state, you're right. On the other hand, to me that seems an odd goal to prioritize. After all, the older releases are still there, for that antique system. Nor do I see an answer to Russ' final question: ``What if, as you proposed earlier, the patch were to leave the old behavior if the threading model on the given platform were not supported?''
There's some code which releanses the GIL, precisely so that you can run several threads (computations) at once.
If they can get hold of the GIL in the first place! Yes, you'd want to be able to "unbind" threads if you knew what you were doing, so that they could run on other cores, and you'd want a switch to disable the affinity mechanism entirely. But I'd be happy to have things in the naive case run as well as they do on single-core machines, and let experts do optimizations.
"Letting experts do optimizations" is a regression, though, because right now you don't have to be an expert to take advantage of such optimizations (just run a separate thread doing e.g. some zlib compression).
If threading performance wasn't broken on multicore, I'd agree with you. But right now, *everyone* has to be an expert just to use Python 2.x effectively on modern multicore hardware -- you have to find the right patch in issue 7753, apply it to the sources, build a custom python, and use it. Whether or not use explicitly know you are using threads (because some other package may be using them under the covers). Bill

On Mon, 17 May 2010 10:11:03 PDT Bill Janssen <janssen@parc.com> wrote:
I'd hate to let a fundamental flaw like this go through simply because someone somewhere somewhen set a completely synthetic deadline.
[...]
I've read through that issue (several times), and I have to say that I wind up gnashing my teeth each time. Here's a fix that's rejected because it "only" supports NT and POSIX threads. What percentage of Python use cases do those two threading systems cover? Do we know?
Well, if instead of gnashing your teeth, you had contributed to the issue, perhaps a patch would have been committed by now (or perhaps not, but who knows?). If you stay silent, you cannot assume that someone else will stand up for *your* opinion (and the fact that nobody did could indicate that not many people care about the issue, actually).
But right now, *everyone* has to be an expert just to use Python 2.x effectively on modern multicore hardware
Python works reasonably well on multicore hardware, especially if you don't run spinning loops and if you're not on Mac OS X. It may lose *at most* 10-20% performance compared to a single-core run but that's hardly the end of the world. And some workloads won't suffer any degradation. Besides, today's multicore CPUs have far better single-threaded performance than yesteryear's single-core CPUs, which makes the performance regression issue more theoretical than practical. In real life, you have very little risk of witnessing a performance regression when switching your Python from a single-core to a multicore machine. Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
Well, if instead of gnashing your teeth, you had contributed to the issue, perhaps a patch would have been committed by now (or perhaps not, but who knows?). If you stay silent, you cannot assume that someone else will stand up for *your* opinion (and the fact that nobody did could indicate that not many people care about the issue, actually).
Unfortunately, I somehow did not even *know* about the issue until February, after the issue had been closed. What I did know was that some of our big complicated Python multi-threaded daemons had shown puzzling resource hogging when moved from small Macs to large 8-core machines with hardware RAID and lots of memory. But, simpleton that I am, I'd presumed that threading in Python wasn't broken, and was looking elsewhere for the cause.
Python works reasonably well on multicore hardware, especially if you don't run spinning loops and if you're not on Mac OS X.
I'm not sure what you mean by "spinning loops". But I *am* on Mac OS X, along with an increasing percentage of the world. And I'm dismayed that there's no momentum to fix this problem. Not a good sign. Bill

On Mon, 17 May 2010 11:15:49 PDT Bill Janssen <janssen@parc.com> wrote:
What I did know was that some of our big complicated Python multi-threaded daemons had shown puzzling resource hogging when moved from small Macs to large 8-core machines with hardware RAID and lots of memory.
Could you give detailed information about this? Since you're talking about a "big complicated Python multi-threaded daemon", I presume you can't port it to Python 3 very quickly, but it would be nice to know if the problem disappears with 3.2.
I'm not sure what you mean by "spinning loops".
It was an allusion to Dave Beazley's first benchmarks, which merely ran a spinning loop over several threads, and showed catastrophic degradation under OS X.
But I *am* on Mac OS X, along with an increasing percentage of the world. And I'm dismayed that there's no momentum to fix this problem.
There /has/ been momentum in fixing it. In py3k. Regards Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 May 2010 11:15:49 PDT Bill Janssen <janssen@parc.com> wrote:
What I did know was that some of our big complicated Python multi-threaded daemons had shown puzzling resource hogging when moved from small Macs to large 8-core machines with hardware RAID and lots of memory.
Could you give detailed information about this?
Probably not detailed enough. IP issues. It's a version of UpLib.
Since you're talking about a "big complicated Python multi-threaded daemon", I presume you can't port it to Python 3 very quickly, but it would be nice to know if the problem disappears with 3.2.
Yes, it would. As soon as I have working 3.x versions of BeautifulSoup, PIL, ReportLab, JCC, pylucene, pyglet, nltk, email, epydoc, feedparser, dictclient, docutils, hachoir, mutagen, medusa, python-dateutil, and vobject, I'll let you know. :-)
There /has/ been momentum in fixing it. In py3k.
Yes, I specifically meant in the 2.x branch. I'm guessing I'll have to stay on 2.x for at least 5 more years, due to the other package dependencies. Bill

On 5/17/2010 2:59 PM, Bill Janssen wrote:
Yes, it would. As soon as I have working 3.x versions of BeautifulSoup, PIL, ReportLab, JCC, pylucene, pyglet, nltk, email, epydoc, feedparser, dictclient, docutils, hachoir, mutagen, medusa, python-dateutil, and vobject, I'll let you know. :-)
There /has/ been momentum in fixing it. In py3k.
Yes, I specifically meant in the 2.x branch. I'm guessing I'll have to stay on 2.x for at least 5 more years, due to the other package dependencies.
I suspect it will be sooner than that, especially if users like you ask/beg/plead with the maintainers of libraries like those you listed to make them work with 3.2. Give your particular reason, that Python3 will work increasingly well with multicore machines. I an sure a couple of such people have posted that they see no reason to upgrade until users start requesting them to. Terry Jan Reedy

Terry Reedy <tjreedy@udel.edu> wrote:
On 5/17/2010 2:59 PM, Bill Janssen wrote:
Yes, it would. As soon as I have working 3.x versions of BeautifulSoup, PIL, ReportLab, JCC, pylucene, pyglet, nltk, email, epydoc, feedparser, dictclient, docutils, hachoir, mutagen, medusa, python-dateutil, and vobject, I'll let you know. :-)
There /has/ been momentum in fixing it. In py3k.
Yes, I specifically meant in the 2.x branch. I'm guessing I'll have to stay on 2.x for at least 5 more years, due to the other package dependencies.
I suspect it will be sooner than that, especially if users like you ask/beg/plead with the maintainers of libraries like those you listed to make them work with 3.2.
Oh, that's the way I like to spend my day (and, as you can tell from this conversation, I'm really good at it :-). Though I will of course do that. But some of these, like JCC+pylucene, nltk, and vobject, were developed with idiosyncratic funding resources which no longer exist. Others, like pyglet, were developed for a completely different purpose, and I doubt the developers care what I want. So, realistically, I doubt it will be less than five years. Bill

But some of these, like JCC+pylucene, nltk, and vobject, were developed with idiosyncratic funding resources which no longer exist. Others, like pyglet, were developed for a completely different purpose, and I doubt the developers care what I want. So, realistically, I doubt it will be less than five years.
Make it a GSoC project next summer, and you have them ported next fall. Regards, Martin

Bill Janssen <janssen@parc.com> wrote:
use it. Whether or not use explicitly know you are using threads (because some other package may be using them under the covers).
Of course, I meant to say, "Whether or not *youse* explicitly know you are using threads (because some other package may be using them under the covers)." :-). Bill

Not fixing this is a big problem. It relegates the only Python which will be usable by many many people for many years, 2.x, to a toy language status on modern machines. It will have threads, but the use of them, either directly or indirectly by modules you may import, may cause unpredictable performance problems.
People may disagree with this characterization, but if we take that for granted, then, yes, we are willing to accept that as the state of things for the coming years. People running into these problems will have a number of choices to them: switch operating systems (i.e. drop OSX for something that actually works), switch programming languages (i.e. drop Python for something that actually works), switch application architectures (i.e. drop threads for something that actually works), switch to 3.x, or just accept the problem, and hope that the system will find something else to do while switching Python threads.
I'd hate to let a fundamental flaw like this go through simply because someone somewhere somewhen set a completely synthetic deadline.
No, it's not like that. We set the deadline so that we are able to cancel discussions like this one. It would be possible to change the schedule, if we would agree that it was for a good cause - which we don't.
If threading performance wasn't broken on multicore, I'd agree with you. But right now, *everyone* has to be an expert just to use Python 2.x effectively on modern multicore hardware
Not at all. Just use the multiprocessing module instead, and be done. It's really easy to use if you already managed to understand threads. Regards, Martin

Martin v. Löwis <martin@v.loewis.de> wrote:
I'd hate to let a fundamental flaw like this go through simply because someone somewhere somewhen set a completely synthetic deadline.
No, it's not like that. We set the deadline so that we are able to cancel discussions like this one. It would be possible to change the schedule, if we would agree that it was for a good cause - which we don't.
I do appreciate that, and also what you and Antoine are saying.
If threading performance wasn't broken on multicore, I'd agree with you. But right now, *everyone* has to be an expert just to use Python 2.x effectively on modern multicore hardware
Not at all. Just use the multiprocessing module instead, and be done. It's really easy to use if you already managed to understand threads.
But that's just the problem. Most folks don't use "threads", they use a higher-level abstraction like the nltk library. Does it use threads? Has its owner ported it to py3k? Has its owner ported it to the multiprocessing module? I have to be an expert to know. I'll stop talking about this now... At least, here. Apparently we only need to fix this for OS X. Bill

Bill Janssen, 17.05.2010 23:09:
Most folks don't use "threads"
Seems like a somewhat reasonable assumption to me.
they use a higher-level abstraction like the nltk library.
I have my doubts that this applies to "most folks" - likely not even to most of those who use threads. Stefan

Martin v. Löwis wrote:
People running into these problems will have a number of choices to them: switch operating systems (i.e. drop OSX for something that actually works), switch programming languages (i.e. drop Python for something that actually works), switch application architectures (i.e. drop threads for something that actually works), switch to 3.x, or just accept the problem, and hope that the system will find something else to do while switching Python threads.
There's even another option: if the new-GIL backport patch in http://bugs.python.org/issue7753 works for you, apply it and run with it (and advocate for a Python 2.8 release to make it more widely available, possibly even contributing the fallback behaviour discussed in the issue to make that situation more likely). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Antoine Pitrou wrote:
On Sun, 16 May 2010 15:13:44 PDT Bill Janssen <janssen@parc.com> wrote:
So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set.
This is not really a good idea. There's some code which releases the GIL, precisely so that you can run several threads (computations) at once.
Somewhat irrelevant given the rest of this thread, but you could potentially deal with that by playing CPU affinity games in the BEGIN/END_ALLOW_THREADS macros. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 16 May 2010 15:13:44 PDT Bill Janssen <janssen@parc.com> wrote:
So the patch to the threading code would presumably, for those OSs where the capability exists, try to put all created threads in the same affinity set.
This is not really a good idea. There's some code which releases the GIL, precisely so that you can run several threads (computations) at once.
Could the macro that releases the GIL also release the thread affinity? And the macro that acquires it again set the affinity tag back again? Bill

Le lundi 17 mai 2010 à 14:36 -0700, Bill Janssen a écrit :
Could the macro that releases the GIL also release the thread affinity?
We release the GIL in a variety of situations which don't necessarily involve heavy computations (such as: waiting for IO or sleeping). Furthermore, having several threads taking / dropping the GIL would make the affinity setting / unsetting pattern quite chaotic. Really, I think the processor affinity solution can only be application-specific. There doesn't seem to be an easy, generic way of guessing whether some kind of processor affinity should be requested or not. Regards Antoine.

Hi, Le dimanche 16 mai 2010 22:07:06, Nir Aides a écrit :
*Evolving the GIL into a scheduler*
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it. The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios (see data below).
I didn't followed last development around the GIL. Can you explain me why Python should have its own scheduler whereas each OS has already its own scheduler? The OS has useful informations to help scheduling that Python doesn't have. Linux and FreeBSD schedulers are faster each year since... 5 years?, especially with multiple CPU/cores. -- Victor Stinner http://www.haypocalc.com/

On Sun, May 16, 2010 at 22:52, Victor Stinner <victor.stinner@haypocalc.com> wrote:
I didn't followed last development around the GIL. Can you explain me why Python should have its own scheduler whereas each OS has already its own scheduler?
Because the GIL locks and unlocks threads, in practice, it already have. But the scheduler is so simplistic it ends up fighting with the OS scheduler, and a large amount of CPU time is used up switching instead of executing. Having a proper scheduler fixes this. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64

On Mon, May 17, 2010 at 14:12, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 May 2010 08:28:08 +0200 Lennart Regebro <regebro@gmail.com> wrote:
But the scheduler is so simplistic it ends up fighting with the OS scheduler, and a large amount of CPU time is used up switching instead of executing.
This is already fixed with py3k.
Are you referring to the "New GIL"? -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64

On Mon, 17 May 2010 14:47:25 +0200 Lennart Regebro <regebro@gmail.com> wrote:
On Mon, May 17, 2010 at 14:12, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 May 2010 08:28:08 +0200 Lennart Regebro <regebro@gmail.com> wrote:
But the scheduler is so simplistic it ends up fighting with the OS scheduler, and a large amount of CPU time is used up switching instead of executing.
This is already fixed with py3k.
Are you referring to the "New GIL"?
Yes.

On Mon, May 17, 2010 at 15:05, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 May 2010 14:47:25 +0200 Lennart Regebro <regebro@gmail.com> wrote:
On Mon, May 17, 2010 at 14:12, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 17 May 2010 08:28:08 +0200 Lennart Regebro <regebro@gmail.com> wrote:
But the scheduler is so simplistic it ends up fighting with the OS scheduler, and a large amount of CPU time is used up switching instead of executing.
This is already fixed with py3k.
Are you referring to the "New GIL"?
Yes.
At has been shown, it also in certain cases will race with the OS scheduler, so this is not already fixed, although apparently improved, if I understand correctly. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64

On Tue, 18 May 2010 08:45:41 +0200 Lennart Regebro <regebro@gmail.com> wrote:
Are you referring to the "New GIL"?
Yes.
At has been shown, it also in certain cases will race with the OS scheduler, so this is not already fixed, although apparently improved, if I understand correctly.
"Race" is a strange term here and I'm not sure what you mean. The issue found out by Dave Beazley can't be reasonably described by this word, I think. Please read and understand the issue report mentioned by Nir before trying to make statements based on rumours heard here and there. Antoine.

On Tue, May 18, 2010 at 12:53, Antoine Pitrou <solipsis@pitrou.net> wrote:
"Race" is a strange term here and I'm not sure what you mean. The issue found out by Dave Beazley can't be reasonably described by this word, I think.
OK, maybe "race" is the wrong word. But that doesn't mean the issue doesn't exist.
Please read and understand the issue report mentioned by Nir before trying to make statements based on rumours heard here and there.
Oh, so Dave Beazleys reports is a rumour now. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64

Le mardi 18 mai 2010 à 14:16 +0200, Lennart Regebro a écrit :
Please read and understand the issue report mentioned by Nir before trying to make statements based on rumours heard here and there.
Oh, so Dave Beazleys reports is a rumour now.
Your and other people's grandiloquent interpretation of them is. Now let me ask you a question: did you witness some of the effects mentioned here? Did it disturb the proper functioning one of your applications, programs or services? If yes, please be so kind as to explain how; it will be an useful datapoint. Bonus points if the issue affects Python 3.2, since that's really what Nir is talking about. If not, then do you have any valuable information to contribute to this discussion?

On Tue, May 18, 2010 at 14:52, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le mardi 18 mai 2010 à 14:16 +0200, Lennart Regebro a écrit :
Please read and understand the issue report mentioned by Nir before trying to make statements based on rumours heard here and there.
Oh, so Dave Beazleys reports is a rumour now.
Your and other people's grandiloquent interpretation of them is.
Now let me ask you a question: did you witness some of the effects mentioned here? Did it disturb the proper functioning one of your applications, programs or services? If yes, please be so kind as to explain how; it will be an useful datapoint. Bonus points if the issue affects Python 3.2, since that's really what Nir is talking about.
If not, then do you have any valuable information to contribute to this discussion?
I doubt anything I say can be less constructive than your rude comments. -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64

Lennart Regebro wrote:
On Tue, May 18, 2010 at 14:52, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le mardi 18 mai 2010 à 14:16 +0200, Lennart Regebro a écrit :
Please read and understand the issue report mentioned by Nir before trying to make statements based on rumours heard here and there. Oh, so Dave Beazleys reports is a rumour now. Your and other people's grandiloquent interpretation of them is.
Now let me ask you a question: did you witness some of the effects mentioned here? Did it disturb the proper functioning one of your applications, programs or services? If yes, please be so kind as to explain how; it will be an useful datapoint. Bonus points if the issue affects Python 3.2, since that's really what Nir is talking about.
If not, then do you have any valuable information to contribute to this discussion?
I doubt anything I say can be less constructive than your rude comments.
I can understand why Antoine is being offended: it's his implementation that you attacked. You literally said "At has been shown, it also in certain cases will race with the OS scheduler, so this is not already fixed", claiming that it is not fixed I believe Antoine does consider it fixed, on the grounds that all counter-examples provided so far are made-up toy examples, rather than actual applications that still suffer from the original problems. So please join us in considering the issue fixed unless you can provide a really world example that demonstrates the contrary. Regards, Martin

On Tue, May 18, 2010 at 3:43 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
So please join us in considering the issue fixed unless you can provide a really world example that demonstrates the contrary.
The server software I maintain (openrpg) experiences this issue with when I tried porting the server code to 3.2. Granted it was only a 5 to 7% speed drop over single core, though with the old GIL (py2.5) there was a 25% to 30% speed drop (It can be running upto 300 IO bound threads & 30 CPU bound threads) so a net improvement of 20% so I am more than happy with the new GIL. I think the new GIL should be given a year or so in the wild before you start trying to optimize theoretical issues you may run into. If in a year people come back and have some examples of where a proper scheduler would help improve speed on multi-core systems even more, then we can address the issue at that time.

I think the new GIL should be given a year or so in the wild before you start trying to optimize theoretical issues you may run into. If in a year people come back and have some examples of where a proper scheduler would help improve speed on multi-core systems even more, then we can address the issue at that time.
Exactly my feelings. Regards, Martin

On Tue, 18 May 2010 21:43:30 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
I can understand why Antoine is being offended: it's his implementation that you attacked. You literally said "At has been shown, it also in certain cases will race with the OS scheduler, so this is not already fixed", claiming that it is not fixed
I believe Antoine does consider it fixed, on the grounds that all counter-examples provided so far are made-up toy examples, rather than actual applications that still suffer from the original problems.
Ok, this is a good opportunity to try to sum up, from my point of view. The main problem of the old GIL, which was evidenced in Dave's original study (not this year's, but the previous one) *is* fixed unless someone demonstrates otherwise. It should be noted that witnessing a slight performance degradation on a multi-core machine is not enough to demonstrate such a thing. The degradation could be caused by other factors, such as thread migration, bad OS behaviour, or even locking peculiarities in your own application, which are not related to the GIL. A good test is whether performance improves if you play with sys.setswitchinterval(). Dave's newer study regards another issue, which I must stress is also present in the old GIL algorithm, and therefore must have affected, if it is serious, real-world applications in 2.x. And indeed, the test I recently added to ccbench evidences the huge drop in socket I/Os per second when there's a background CPU thread; this test exercises the same situation as Dave's demos, only with a less trivial CPU workload: == CPython 2.7b2+.0 (trunk:81274M) == == x86_64 Linux on 'x86_64' == --- I/O bandwidth --- Background CPU task: Pi calculation (Python) CPU threads=0: 23034.5 packets/s. CPU threads=1: 6.4 ( 0 %) CPU threads=2: 15.7 ( 0 %) CPU threads=3: 13.9 ( 0 %) CPU threads=4: 20.8 ( 0 %) (note: I've just changed my desktop machine, so these figures are different from what I've posted weeks or months ago) Regardless of the fact that apparently noone reported it in real-world conditions, we *could* decide that the issue needs fixing. If we decide so, Nir's approach is the most rigorous one: it tries to fix the problem thoroughly, rather than graft an additional heuristic. Nir also has tested his patch on a variety of machines, more so than Dave and I did with our own patches; he is obviously willing to go forward. Right now, there are two problems with Nir's proposal: - first, what Nick said: the difficulty of having reliable high-precision cross-platform time sources, which are necessary for the BFS algorithm. Ironically, timestamp counters have their own problems on multi-core machines (they can go out of sync between CPUs). gettimeofday() and clock_gettime() may be precise enough on most Unices, though. - second, the BFS algorithm is not that well-studied, since AFAIK it was refused for inclusion in the Linux kernel; someone in the python-dev community would therefore have to make sense of, and evaluate, its heuristic. I also don't consider my own patch a very satisfactory "solution", although it has the reassuring quality of being simple and short (and easy to revert!). That said, most of us are programmers and we love to invent ways of fixing technical issues. It sometimes leads us to consider some things issues even when they are mostly theoretical. This is why I am lukewarm on this. I think interested people should focus on real-world testing (rather than Dave and I's synthetic tests) of the new GIL, with or without the various patches, and share the results. Otherwise, Dj Gilcrease's suggestion of waiting for third-party reports is also a very good one. Regards Antoine.

On Tue, May 18, 2010 at 4:10 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 18 May 2010 21:43:30 +0200
<snip>
Regardless of the fact that apparently noone reported it in real-world conditions, we *could* decide that the issue needs fixing. If we decide so, Nir's approach is the most rigorous one: it tries to fix the problem thoroughly, rather than graft an additional heuristic. Nir also has tested his patch on a variety of machines, more so than Dave and I did with our own patches; he is obviously willing to go forward.
Right now, there are two problems with Nir's proposal:
- first, what Nick said: the difficulty of having reliable high-precision cross-platform time sources, which are necessary for the BFS algorithm. Ironically, timestamp counters have their own problems on multi-core machines (they can go out of sync between CPUs). gettimeofday() and clock_gettime() may be precise enough on most Unices, though.
- second, the BFS algorithm is not that well-studied, since AFAIK it was refused for inclusion in the Linux kernel; someone in the python-dev community would therefore have to make sense of, and evaluate, its heuristic.
I don't have the expertise to do this, but I'll be playing with the patch over the next few weeks, so if there's a specific piece of data you want, let me know. Geremy Condra

2010/5/16 Nir Aides <nir@winpdb.org>
*What can be done with it?*
Here are some options: 1) Abandon it - no one is interested, yawn. 2) Take ideas and workarounds from its code and apply to other patches. 3) Include it in the interpreter as an auxiliary (turn on with a runtime switch) scheduler. 4) Adopt it as the Python scheduler.
*Opinion?*
I would like to have the possibility to "./configure --without-broken-GIL" or "./configure --with-bfs-scheduler" in Python 2.7 like we "./configure --with-computed-gotos" in 3.2. It will let the opportunity for the experienced user (or the distribution packager) to enable the threading optimizations on its platform, while it preserves the current behaviour by default. It will give more chance that people test this experimental configure option and report any issue they find, so it can be fixed and improved in the next bugfix version (2.7.1). Since the 2.7 branch will have long term support, it makes sense to support multi-core platforms. And more users of the fixed GIL means more bugfixes (even for 3.x branch). -- Florent

Florent Xicluna wrote:
I would like to have the possibility to "./configure --without-broken-GIL" or "./configure --with-bfs-scheduler" in Python 2.7 like we "./configure --with-computed-gotos" in 3.2.
It will let the opportunity for the experienced user (or the distribution packager) to enable the threading optimizations on its platform, while it preserves the current behaviour by default.
It will give more chance that people test this experimental configure option and report any issue they find, so it can be fixed and improved in the next bugfix version (2.7.1). Since the 2.7 branch will have long term support, it makes sense to support multi-core platforms.
And more users of the fixed GIL means more bugfixes (even for 3.x branch).
Would you suggest extending this as far as providing "new GIL" binaries for Windows and Mac OS X? (If that was the case, it begins to sound a lot like http://bugs.python.org/issue7753 with reversion to the old GIL on non-NT/POSIX platforms) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

I would like to restart this thread with 2 notes from the lively discussion: a) Issue 7946 (and this discussion?) concerns Python 3.2 b) The GIL problems are not specific to OSX. The old and new GIL misbehave on GNU/Linux and Windows too. [Putting on anti-frying-pan helmet] Nir

Nir Aides wrote:
I would like to restart this thread with 2 notes from the lively discussion:
a) Issue 7946 (and this discussion?) concerns Python 3.2 b) The GIL problems are not specific to OSX. The old and new GIL misbehave on GNU/Linux and Windows too.
I think Antoine and Bill went off an a bit of a tangent that *is* specific to Mac OS X and the old GIL (where a Python application not only fails to take advantage of additional cores, but actually runs slower than it does on a less powerful single core machine). The convoying problem identified in issue 7946 does indeed apply to the new GIL on multiple platforms. Without reviewing either proposed patch in detail, I personally am slightly inclined to favour David's suggested solution, both because I better understand the explanation of how it works (and simplicity is a virtue from a maintainability point of view), and because the BFS approach appears to run into trouble when it comes to identifying a suitable cross platform time reference. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Sun, May 16, 2010 at 1:07 PM, Nir Aides <nir@winpdb.org> wrote:
Relevant Python issue: http://bugs.python.org/issue7946
Is there any chance Antoine's "gilinter" patch from that issue might be applied to python 2.7? I have been experiencing rare long delays in simple io operations in multithreaded python applications, and I suspect that they might be related to this issue. -Mike

On Tue, 18 May 2010 14:39:43 -0700 Mike Klaas <mike.klaas@gmail.com> wrote:
On Sun, May 16, 2010 at 1:07 PM, Nir Aides <nir@winpdb.org> wrote:
Relevant Python issue: http://bugs.python.org/issue7946
Is there any chance Antoine's "gilinter" patch from that issue might be applied to python 2.7? I have been experiencing rare long delays in simple io operations in multithreaded python applications, and I suspect that they might be related to this issue.
There's no chance for this since the patch relies on the new GIL. (that's unless there's a rush to backport the new GIL in 2.7, of course) I think your "rare long delays" might be related to the old GIL's own problems, though. How long are they? Regards Antoine.

On Tue, May 18, 2010 at 2:50 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
There's no chance for this since the patch relies on the new GIL. (that's unless there's a rush to backport the new GIL in 2.7, of course)
Thanks I missed that detail.
I think your "rare long delays" might be related to the old GIL's own problems, though. How long are they?
Typically between 20 and 60s. This is the time it takes to send and receive a single small packet on an already-active tcp connection to ensure it is still alive. Most of the time it is < 1ms. I don't have strong evidence that GIL issues are causing the problem, because I can't reliably reproduce the issue. But the general setup is similar (one thread doing light io experiencing odd delays in a process with multiple threads that are often cpu-bound, on a multi-core machine) thanks, -Mike

On Tue, 18 May 2010 17:26:44 -0700 Mike Klaas <mike.klaas@gmail.com> wrote:
I think your "rare long delays" might be related to the old GIL's own problems, though. How long are they?
Typically between 20 and 60s.
You mean milliseconds I suppose? If it's the case, then you may simply be witnessing garbage collection runs. I've measured garbage collection runs of about 50 ms each on a Web application, with the full framework loaded and a bunch of objects in memory. If you really meant seconds, it looks a bit too high to be GIL-related. What kind of things are the CPU threads doing? Regards Antoine.
participants (13)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Bill Janssen
-
Dj Gilcrease
-
Florent Xicluna
-
geremy condra
-
Lennart Regebro
-
Mike Klaas
-
Nick Coghlan
-
Nir Aides
-
Stefan Behnel
-
Terry Reedy
-
Victor Stinner