
Hello, As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second. I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads": - the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again. - the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all. Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley. Any thoughts? Regards Antoine.

On Sat, Mar 13, 2010 at 13:52, Benjamin Peterson <benjamin@python.org>wrote:
Any thoughts? \ The latter solution seems best to me as it would help any 3rd party IO
2010/3/13 Antoine Pitrou <solipsis@pitrou.net>: libraries and require less code modification.
Plus the interactiveness approach has been tested by OS thread schedulers for years and is shown to work. So I vote for the latter approach as well. -Brett
-- Regards, Benjamin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/brett%40python.org

Brett Cannon wrote:
On Sat, Mar 13, 2010 at 13:52, Benjamin Peterson <benjamin@python.org <mailto:benjamin@python.org>> wrote:
2010/3/13 Antoine Pitrou <solipsis@pitrou.net <mailto:solipsis@pitrou.net>>: > Any thoughts? \ The latter solution seems best to me as it would help any 3rd party IO libraries and require less code modification.
Plus the interactiveness approach has been tested by OS thread schedulers for years and is shown to work. So I vote for the latter approach as well.
+1 Cheap to calculate and means developer don't need to guess if they should be using the "I am CPU bound" or the "I am IO bound" GIL acquisition macro. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On 13Mar2010 23:03, Martin v. L�wis <martin@v.loewis.de> wrote: | > Any thoughts? | | My initial reaction to this report, and still my current opinion is: | This issue is not a problem. | It's a boundary case, so it's not clear whether Python should be able to | deal with it gracefully. I doubt it's a realistic case. How was it tripped over? Speaking for myself, I have an app with a daemon mode which I expect will sometimes behave as described; it answers requests and thus has I/O bound threads waiting for requests and dispatching replies, and threads doing data handling, which make constant use of the zlib library. On the client side the same app is often throughput bound by a data examination process that is compute bound; I can easily see it having compute bound threads and I/O bound threads talking to daemon instances. Currently all the above is somewhat "batchy"; the client end looks like an archiving command line tool, but it's intended to have a FUSE mode to present the archive as a filesystem; that I can imagine tripping over this issue as a user uses the filesystem for "stuff". Still, it is unusual and I suspect it will self limit to an extent anyway; I can't fairly claim to have had it cause me trouble yet. (Indeed, I've not used the new GIL at all - it's mostly using python 2.6). Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ I very strongly suggest that you periodically place ice packs over the abused areas. - Steve Garnier

Speaking for myself, I have an app with a daemon mode which I expect will sometimes behave as described; it answers requests and thus has I/O bound threads waiting for requests and dispatching replies, and threads doing data handling, which make constant use of the zlib library.
This is then already different from the scenario described. Every call into zlib will release the GIL, for the period of the zlib computation, allowing other threads to run.
On the client side the same app is often throughput bound by a data examination process that is compute bound; I can easily see it having compute bound threads and I/O bound threads talking to daemon instances.
I can't see that. I would expect that typically (and specifically including your application), the compute bound threads will synchronize with the IO bound ones, asking for more requests to perform. That's the whole point of using the "bound" adjective (?): execution time is *bound* by the computation time. It can't get faster than what the computation can process. Regards, Martni

On 14Mar2010 19:32, "Martin v. Löwis" <martin@v.loewis.de> wrote: | > Speaking for myself, I have an app with a daemon mode which I expect | > will sometimes behave as described; it answers requests and thus has I/O | > bound threads waiting for requests and dispatching replies, and threads | > doing data handling, which make constant use of the zlib library. | | This is then already different from the scenario described. Every call | into zlib will release the GIL, for the period of the zlib computation, | allowing other threads to run. | | > On the | > client side the same app is often throughput bound by a data examination | > process that is compute bound; I can easily see it having compute bound | > threads and I/O bound threads talking to daemon instances. | | I can't see that. I would expect that typically (and specifically | including your application), the compute bound threads will synchronize | with the IO bound ones, asking for more requests to perform. In the single threaded case, sure. The usual command line "archive this" mode fit this. But... | That's the whole point of using the "bound" adjective (?): execution | time is *bound* by the computation time. It can't get faster than what | the computation can process. If it's lurking behind a filesystem interface or in its daemon mode (remote archive store), multiple client processes can be using it at once, and it will be processing multiple tasks somewhat in parallel. Here one can get a compute bound thread answering one request, impacting quick response to other parallel-and-cheap requests. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Here was a man who not only had a ready mind and a quick wit, but could also sing. - _Rope_

If it's lurking behind a filesystem interface or in its daemon mode (remote archive store), multiple client processes can be using it at once, and it will be processing multiple tasks somewhat in parallel. Here one can get a compute bound thread answering one request, impacting quick response to other parallel-and-cheap requests.
However, "impacting" will always be the case. *Of course* any thread that performs computation impacts everything else on the same processor. There is nothing to prevent that, unless you schedule the long-running activity at a point in time where you know the system will be idle for the entire run of that task. Regards, Martin

On 16Mar2010 09:02, "Martin v. Löwis" <martin@v.loewis.de> wrote: | > If it's lurking behind a filesystem interface or in its daemon mode | > (remote archive store), multiple client processes can be using it at once, | > and it will be processing multiple tasks somewhat in parallel. Here one | > can get a compute bound thread answering one request, impacting quick | > response to other parallel-and-cheap requests. | | However, "impacting" will always be the case. *Of course* any thread | that performs computation impacts everything else on the same processor. | There is nothing to prevent that, unless you schedule the long-running | activity at a point in time where you know the system will be idle for | the entire run of that task. Sure, but one has only to dip one's toe in the long running Linux interactivity scheduling wars to know that there's a _lot_ of attraction in a scheduling approach that favours the usually-blocked task over the CPU bound task, and various approaches like the early low-latency patches to the later in-the-scheduler changes were greeted with much rejoicing. Certainly there's only so much CPU to go around, but if the tasks blocked waiting for something to happen get to run more immediately when an event does occur, you tend to get better "responsiveness". Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Be smart, be safe, be paranoid. - Ryan Cousineau, courier@compdyn.com DoD#863, KotRB, KotKWaWCRH

Martin v. Löwis <martin@v.loewis.de> wrote:
If it's lurking behind a filesystem interface or in its daemon mode (remote archive store), multiple client processes can be using it at once, and it will be processing multiple tasks somewhat in parallel. Here one can get a compute bound thread answering one request, impacting quick response to other parallel-and-cheap requests.
However, "impacting" will always be the case. *Of course* any thread that performs computation impacts everything else on the same processor.
That's not the problem, Martin. The problem is that it also impacts other threads on other cores, because of the interlock between the OS thread scheduler and the internal Pythonic struggle to obtain the GIL. And it shouldn't. Bill

Cameron Simpson <cs@zip.com.au> wrote:
Currently all the above is somewhat "batchy"; the client end looks like an archiving command line tool, but it's intended to have a FUSE mode to present the archive as a filesystem; that I can imagine tripping over this issue as a user uses the filesystem for "stuff".
Me, too :-). Though I also do a lot of processing in pure Python...
Still, it is unusual and I suspect it will self limit to an extent anyway; I can't fairly claim to have had it cause me trouble yet. (Indeed, I've not used the new GIL at all - it's mostly using python 2.6).
The old GIL is causing me a great deal of grief. My server is heavily threaded, and when I get a number of Python 2.6 threads doing layout analysis, my 4 and 8 core machines turn into slow thrashers. I'm dismayed that the backport of the new GIL work to the 2.x line has been stopped (see issue 7753); I see a future of OS X 10.7 with presumably 64-bit Python 2.7 on, say, 32-core machines running even more slowly than it currently does. That's a pretty hard-core backwards compatibility decision. Maybe the Mac-SIG should adopt that patch... Bill

Le Sun, 14 Mar 2010 12:37:48 PDT, Bill Janssen <janssen@parc.com> a écrit :
The old GIL is causing me a great deal of grief. My server is heavily threaded, and when I get a number of Python 2.6 threads doing layout analysis, my 4 and 8 core machines turn into slow thrashers.
Have you checked whether the 2.x patch improved your use case? It would be nice to get real-world feedback of which situations the new GIL helps (or doesn't help) improve. Thanks Antoine.

Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Sun, 14 Mar 2010 12:37:48 PDT, Bill Janssen <janssen@parc.com> a écrit :
The old GIL is causing me a great deal of grief. My server is heavily threaded, and when I get a number of Python 2.6 threads doing layout analysis, my 4 and 8 core machines turn into slow thrashers.
Have you checked whether the 2.x patch improved your use case? It would be nice to get real-world feedback of which situations the new GIL helps (or doesn't help) improve.
Yes, good thought. I'll try it and see if I can get some numbers. I do have a mix of I/O and CPU bound threads, and we are seeing core usage by Python similar to that David originally discovered. Bill

On Sat, Mar 13, 2010 at 3:46 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
I like the second approach as well, assuming "interactiveness" can be computed cheaply. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

If the 5ms interval is too long would it be possible to adaptively reduce it in this situation? How would you detect it? I assume this would be akin to your interactiveness computation. Skip

Le Sat, 13 Mar 2010 17:11:41 -0600, skip@pobox.com a écrit :
If the 5ms interval is too long would it be possible to adaptively reduce it in this situation? How would you detect it? I assume this would be akin to your interactiveness computation.
I think modulating the interval is what Dave Beazley's own patch proposal does. It has a full-blown priority system. It is more complicated than either of the two approaches I am more proposing, though. (also, per Dave's own admission, his patch is a proof-of-concept rather than a finished piece of work)

There are two possible problems with Dave's benchmark: 1) On my system setting TCP_NODELAY option on the accepted server socket changes results dramatically. 2) What category of socket servers is dave's spin() function intended to simulate? In a server which involves CPU intensive work in response to a socket request the behavior may be significantly different. In such a system, high CPU load will significantly reduce socket responsiveness which in turn will reduce CPU load and increase socket responsiveness. Testing with a modified server that reflects the above indicates the new GIL behaves just fine in terms of throughput. So a change to the GIL may not be required at all. There is still the question of latency - a single request which takes long time to process will affect the latency of other "small" requests. However, it can be argued if such a scenario is practical, or if modifying the GIL is the solution. If a change is still required, then I vote for the simpler approach - that of having a special macro for socket code. I remember there was reluctance in the past to repeat the OS scheduling functionality and for a good reason. Nir On Sat, Mar 13, 2010 at 11:46 PM, Antoine Pitrou <solipsis@pitrou.net>wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org

On 3/14/10 7:31 AM, "Nir Aides" <nir@winpdb.org> wrote:
There are two possible problems with Dave's benchmark:
1) On my system setting TCP_NODELAY option on the accepted server socket changes results dramatically. Could you document what you saw and explain how you think TCP_NODELAY makes a difference, including what kind of system you ran your tests and what the application was that demonstrates those dramatic results?
2) What category of socket servers is dave's spin() function intended to simulate? What is the problem you are trying to get at with this question?
Does Dave¹ spin() function have to have a category? Or perhaps the question is, for these solutions, what category of application do they hurt? Perhaps we can look at the solutions as general, but consider their impact in various categories.
In a server which involves CPU intensive work in response to a socket request the behavior may be significantly different. In such a system, high CPU load will significantly reduce socket responsiveness which in turn will reduce CPU load and increase socket responsiveness. Not sure I follow how high CPU load will in turn reduce CPU load. :) Can you explain more about what you are trying to say here?
Testing with a modified server that reflects the above indicates the new GIL behaves just fine in terms of throughput. So a change to the GIL may not be required at all. Are you saying that a change to the new GIL may not be required at all?
Did your modified server run any worse with any of the proposed changes? Could you document what you saw?
There is still the question of latency - a single request which takes long time to process will affect the latency of other "small" requests. However, it can be argued if such a scenario is practical, or if modifying the GIL is the solution. Perhaps Dave already documented this effect in his visualizations, no?
If a change is still required, then I vote for the simpler approach - that of having a special macro for socket code. What is that simpler approach? How would that special macro work?
I remember there was reluctance in the past to repeat the OS scheduling functionality and for a good reason.
Folks, Do we believe that every other application that has contention for one resource leaves the solution of that problem to the OS scheduler? In what ways do we consider the CPython interpreter to be different than another application that has multiple threads and contention for one resource? Perhaps we have a unique problem among all other user space applications. Perhaps we don¹t. Do we see Antoine¹s or Dave¹s proposed solutions as being solutions to a niche problem? Or are they algorithms for handling one or more of the behaviors the CPython interpreter will encounter when running the many python applications across OS/CPU environments? As for the behavior of the GIL, how are the proposed solutions repeating OS scheduling functionality? Could we instead look at these solutions as repeating what other applications have done that have contention for one resource? Can anyone show that Antoine¹s proposed solution hurts performance? -peter
Nir
On Sat, Mar 13, 2010 at 11:46 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

inline: On Sun, Mar 14, 2010 at 3:54 PM, Peter Portante <peter.a.portante@gmail.com>wrote:
On 3/14/10 7:31 AM, "Nir Aides" <nir@winpdb.org> wrote:
There are two possible problems with Dave's benchmark:
1) On my system setting TCP_NODELAY option on the accepted server socket changes results dramatically.
Could you document what you saw and explain how you think TCP_NODELAY makes a difference, including what kind of system you ran your tests and what the application was that demonstrates those dramatic results?
I first disabled the call to spin() but client running time remained around 30 seconds. I then added TCP_NODELAY and running time dropped to a few dozen milliseconds for the entire no-spin run. The system is Ubuntu Karmic 64bit with latest revision of python 3.2. 2) What category of socket servers is dave's spin() function intended to
simulate?
What is the problem you are trying to get at with this question?
Does Dave’ spin() function have to have a category? Or perhaps the question is, for these solutions, what category of application do they hurt? Perhaps we can look at the solutions as general, but consider their impact in various categories.
In Dave's code sample, spin() is loading the CPU regardless of requests. This may demonstrate how short IO bound requests will behave while the server is processing a long Python-algorithmic CPU intensive request, or an ongoing CPU intensive task unrelated to incoming requests. However is this typical for socket servers? If you change the sample server code to start a CPU bound thread with work X for each incoming request you will see different behavior. There is still the question of latency - a single request which takes long
time to process will affect the latency of other "small" requests. However, it can be argued if such a scenario is practical, or if modifying the GIL is the solution.
Perhaps Dave already documented this effect in his visualizations, no?
Naturally. I did not imply otherwise. His analysis is brilliant.
If a change is still required, then I vote for the simpler approach - that of having a special macro for socket code.
What is that simpler approach? How would that special macro work?
The special macro for socket code is one of the alternatives proposed by Antoine above. However, thinking about it again, with this approach as soon as the new incoming request tries to read a file, query the DB, decompress some data or do anything which releases the GIL, it goes back to square one. no? I remember there was reluctance in the past to repeat the OS scheduling
functionality and for a good reason.
In what ways do we consider the CPython interpreter to be different than another application that has multiple threads and contention for one resource? Perhaps we have a unique problem among all other user space applications. Perhaps we don’t.
I think a rule of thumb is to protect a resource (typically a data structure), as tightly as possible, avoiding for example locking across function calls, etc, if possible. In contrast CPython is "locked" the entire time. As for the behavior of the GIL, how are the proposed solutions repeating OS
scheduling functionality?
Dave discussed this in his analysis. Nir

Le Sun, 14 Mar 2010 23:11:44 +0200, Nir Aides <nir@winpdb.org> a écrit :
I first disabled the call to spin() but client running time remained around 30 seconds. I then added TCP_NODELAY and running time dropped to a few dozen milliseconds for the entire no-spin run.
You don't want the benchmark to be dependent on the TCP stack's implementation specifics. Therefore, you should use the UDP version of Dave Beazley's benchmark. I have posted it on http://bugs.python.org/issue7946, and included a variant of it in ccbench.
The special macro for socket code is one of the alternatives proposed by Antoine above.
However, thinking about it again, with this approach as soon as the new incoming request tries to read a file, query the DB, decompress some data or do anything which releases the GIL, it goes back to square one. no?
Indeed. This approach involves using the macro in every place where releasing the GIL is meant to cover OS- or IO-dependent latencies. Even then, though, there may be cases (such as doing a very small amount of zlib compression) where the background CPU-bound thread will get "too much time" compared to the "mostly IO" thread. But the other approach isn't perfect either. These are heuristics anyway.
I remember there was reluctance in the past to repeat the OS scheduling functionality and for a good reason.
I agree with this. Hence my posting on this list, about whether we should go with an additional amount of complexity to solve what may be considered an annoying performance problem. Regards Antoine.

On Sun, Mar 14, 2010 at 4:31 AM, Nir Aides <nir@winpdb.org> wrote:
There are two possible problems with Dave's benchmark: 1) On my system setting TCP_NODELAY option on the accepted server socket changes results dramatically. 2) What category of socket servers is dave's spin() function intended to simulate? In a server which involves CPU intensive work in response to a socket request the behavior may be significantly different. In such a system, high CPU load will significantly reduce socket responsiveness which in turn will reduce CPU load and increase socket responsiveness. Testing with a modified server that reflects the above indicates the new GIL behaves just fine in terms of throughput. So a change to the GIL may not be required at all. There is still the question of latency - a single request which takes long time to process will affect the latency of other "small" requests. However, it can be argued if such a scenario is practical, or if modifying the GIL is the solution.
Such a scenario is practical and is an area that we really should not fall flat on our face in. Python should not regress to performing worse when an application adds a cpu intensive thread that isn't tempered by the activity of its IO threads. As for the argument that an application with cpu intensive work being driven by the IO itself will work itself out... No it won't, it can get into beat patterns where it is handling requests quite rapidly up until one that causes a long computation to start comes in. At that point it'll stop performing well on other requests for as long (it could be a significant amount of time) as the cpu intensive request threads are running. That is not a graceful degration in serving capacity / latency as one would normally expect. It is a sudden drop off. (followed by a sudden ramp up, and a sudden drop off and a sudden ramp up, and drop off... etc. not pretty. not predictable. not easy to deploy and manage as a service) If we don't fix this issue, we need to document as an official part of the Python threading module docs that people should not mix computation with IO in threaded applications when using CPython if they care about IO performance. This will blow people's mind.
If a change is still required, then I vote for the simpler approach - that of having a special macro for socket code. I remember there was reluctance in the past to repeat the OS scheduling functionality and for a good reason.
I don't consider needing to modify code specifically for it to indicate if it is IO bound or CPU bound to be the simpler approach. So +1 on the interactiveness calculation for me. +0.5 on annotating the IO lock release/acquisitions. Basically: I want to see this fixed, even if its not my preferred approach I still think it is important. -gps

As for the argument that an application with cpu intensive work being driven by the IO itself will work itself out... No it won't, it can get into beat patterns where it is handling requests quite rapidly up until one that causes a long computation to start comes in. At that point it'll stop performing well on other requests for as long (it could be a significant amount of time) as the cpu intensive request threads are running. That is not a graceful degration in serving capacity / latency as one would normally expect. It is a sudden drop off.
Why do you say that? The other threads continue to be served - and Python couldn't use more than one CPU, anyway. Can you demonstrate that in an example? Regards, Martin

On 15Mar2010 09:28, Martin v. L�wis <martin@v.loewis.de> wrote: | > As for the argument that an application with cpu intensive work being | > driven by the IO itself will work itself out... No it won't, it can | > get into beat patterns where it is handling requests quite rapidly up | > until one that causes a long computation to start comes in. At that | > point it'll stop performing well on other requests for as long (it | > could be a significant amount of time) as the cpu intensive request | > threads are running. That is not a graceful degration in serving | > capacity / latency as one would normally expect. It is a sudden drop | > off. | | Why do you say that? The other threads continue to be served - and | Python couldn't use more than one CPU, anyway. Can you demonstrate that | in an example? Real example: I have a FuncMultiQueue class which manages a pool of worker threads. Other tasks can make requests of it by passing a callable (usually obtained via partial()) to its .call(), .bgcall() or .qbgcall() methods depending on how they want to collect the result of the callable. The idea here is that one has a few threads receiving requests (eg a daemon watching a socket or monitoring a db queue table) which then use the FuncMultiQueue to manage how many actual requests are processed in parallel (yes, a semaphore can cover a lot of this, but not the asynchronous call modes). So, suppose a couple of CPU-intensive callables get queued which work for a substantial time, and meanwhile a bunch of tiny tiny cheap requests arrive. Their timely response will be impacted by this issue. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ hybrid rather than pure; compromising rather than clean; distorted rather than straightforward; ambiguous rather than articulated; both-and rather than either-or; the difficult unity of inclusion rather than the easy unity of exclusion. - Paul Barton-Davis <pauld@cs.washington.edu>

Cameron Simpson wrote:
On 15Mar2010 09:28, Martin v. L�wis <martin@v.loewis.de> wrote: | > As for the argument that an application with cpu intensive work being | > driven by the IO itself will work itself out... No it won't, it can | > get into beat patterns where it is handling requests quite rapidly up | > until one that causes a long computation to start comes in. At that | > point it'll stop performing well on other requests for as long (it | > could be a significant amount of time) as the cpu intensive request | > threads are running. That is not a graceful degration in serving | > capacity / latency as one would normally expect. It is a sudden drop | > off. | | Why do you say that? The other threads continue to be served - and | Python couldn't use more than one CPU, anyway. Can you demonstrate that | in an example?
Real example:
... unfortunately without a demonstration. What's the throughput under the old GIL? What's the throughput of this application under the new GIL? How can I observe the "beat pattern"?
The idea here is that one has a few threads receiving requests (eg a daemon watching a socket or monitoring a db queue table) which then use the FuncMultiQueue to manage how many actual requests are processed in parallel (yes, a semaphore can cover a lot of this, but not the asynchronous call modes).
Why do you say processing is in parallel? In Python, processing is normally never in parallel, but always sequential (potentially interleaving). Are you releasing the GIL for processing?
So, suppose a couple of CPU-intensive callables get queued which work for a substantial time, and meanwhile a bunch of tiny tiny cheap requests arrive. Their timely response will be impacted by this issue.
By how much exactly? What operating system? Regards, Martin

On 16Mar2010 08:59, "Martin v. Löwis" <martin@v.loewis.de> wrote: | Cameron Simpson wrote: | > On 15Mar2010 09:28, Martin v. L�wis <martin@v.loewis.de> wrote: | > | > As for the argument that an application with cpu intensive work being | > | > driven by the IO itself will work itself out... No it won't, it can | > | > get into beat patterns where it is handling requests quite rapidly up | > | > until one that causes a long computation to start comes in. At that | > | > point it'll stop performing well on other requests for as long (it | > | > could be a significant amount of time) as the cpu intensive request | > | > threads are running. That is not a graceful degration in serving | > | > capacity / latency as one would normally expect. It is a sudden drop | > | > off. | > | | > | Why do you say that? The other threads continue to be served - and | > | Python couldn't use more than one CPU, anyway. Can you demonstrate that | > | in an example? | > | > Real example: | | ... unfortunately without a demonstration. What's the throughput under | the old GIL? What's the throughput of this application under the new | GIL? How can I observe the "beat pattern"? You can't. This isn't an example where I am personally bitten by the GIL. I may be, but there's plenty of other stuff in terms of tuning or optimisation in my particular app before I get anywhere near the GIL. My point is not that it's biting me right now but that earlier you said the scenario was probably unrealistic, but I have an app which can probably exhibit exactly the issue under discussion. | > The idea here is that one has a few threads receiving requests (eg a | > daemon watching a socket or monitoring a db queue table) which then use | > the FuncMultiQueue to manage how many actual requests are processed | > in parallel (yes, a semaphore can cover a lot of this, but not the | > asynchronous call modes). | | Why do you say processing is in parallel? | In Python, processing is | normally never in parallel, but always sequential (potentially | interleaving). Are you releasing the GIL for processing? I mean that I can have multiple actual requests being served, and they are _not_ handled sequentially - the multi queue is to permit more than one to be in progress at once so that an expensive request need not inherently delay a following cheap request. And in my it's-a-filesystem mode there will often be a bottleneck in the pure-python data scanning part. And thus the CPU bound python active while still accepting requests in I/O blocked threads. | > So, suppose a couple of CPU-intensive callables get queued which work for a | > substantial time, and meanwhile a bunch of tiny tiny cheap requests arrive. | > Their timely response will be impacted by this issue. | | By how much exactly? What operating system? Can't tell you - I'm not claiming the current GIL behaviour is biting me right now; I'm saying your claim that the scenario is unrealistic is a bit unfair; my app wil probably be just the kind of app to be affected, even only subtly. And it can hardly be alone in the "daemon" world. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ There's two kinds of climbers...smart ones, and dead ones. - Don Whillans

You can't. This isn't an example where I am personally bitten by the GIL. I may be, but there's plenty of other stuff in terms of tuning or optimisation in my particular app before I get anywhere near the GIL.
My point is not that it's biting me right now but that earlier you said the scenario was probably unrealistic, but I have an app which can probably exhibit exactly the issue under discussion.
I don't understand. First you say that there is plenty of other stuff .... before you get anywhere near the GIL, and then you say that your app can probably exhibit the issue under discussion. Which one is it? I can easily believe that there is tons of other stuff that make the GIL contention irrelevant, I can not believe that it will exhibit the issue under discussion.
| > The idea here is that one has a few threads receiving requests (eg a | > daemon watching a socket or monitoring a db queue table) which then use | > the FuncMultiQueue to manage how many actual requests are processed | > in parallel (yes, a semaphore can cover a lot of this, but not the | > asynchronous call modes). | | Why do you say processing is in parallel? | In Python, processing is | normally never in parallel, but always sequential (potentially | interleaving). Are you releasing the GIL for processing?
I mean that I can have multiple actual requests being served, and they are _not_ handled sequentially - the multi queue is to permit more than one to be in progress at once so that an expensive request need not inherently delay a following cheap request
But, in Python, due to the GIL, *everything* is sequential, *nothing* is in parallel.
Can't tell you - I'm not claiming the current GIL behaviour is biting me right now; I'm saying your claim that the scenario is unrealistic is a bit unfair; my app wil probably be just the kind of app to be affected, even only subtly. And it can hardly be alone in the "daemon" world.
Unfortunately, you are not able to demonstrate this (because the effect would only be subtle). My claim is that the scenario where the effect is *visible* is unrealistic, i.e. that, in realistic scenarios, tons of other effects actually lessen the problem. Regards, Martin

Martin v. Löwis wrote:
Cameron Simpson wrote:
The idea here is that one has a few threads receiving requests (eg a daemon watching a socket or monitoring a db queue table) which then use the FuncMultiQueue to manage how many actual requests are processed in parallel (yes, a semaphore can cover a lot of this, but not the asynchronous call modes).
Why do you say processing is in parallel? In Python, processing is normally never in parallel, but always sequential (potentially interleaving). Are you releasing the GIL for processing?
We know the GIL is timeslicing - that's the whole point. The real question is *how often* is it time slicing and *which threads* are getting to run. The key issue is that the current new GIL implementation means that, in the presence of a CPU bound GIL-holding thread, the *minimum* duration of *any* C call that releases the GIL becomes sys.getcheckinterval(). The CPU bound thread *always* wants the GIL and once it gets it, it won't give it back until sys.getcheckinterval() expires. This is in contrast to the I/O bound (or non-GIL holding CPU bound) threads that nearly always yield the GIL early, since they're waiting for something else to happen (be it I/O or a native calculation). Antoine's interactiveness patch resolves this by rewarding threads that regularly release the GIL early: the interpreter quickly recognises them as doing so and they are subsequently given priority over the threads that only release the GIL when the check interval expires. When threads that are flagged as interactive ask for the GIL back, the interpreter gives it to them immediately instead of forcing them to wait until the check interval expires (because it trusts those threads not to hang onto the GIL for too long). I like this significantly better than the explicit priority patch, because it means threads are given priority based on their behaviour (i.e. frequently holding the GIL for less than the check interval) rather than relying on developers to correctly decide between using a normal GIL request and a priority request. Handling of thread pools that mix I/O bound and CPU bound GIL-holding tasks isn't quite optimal with this approach (since the nature of any given thread will change over time), but it isn't terrible either - the interpreter doesn't need much time to reclassify a thread as interactive or non-interactive (and there are a couple of parameters that can be tuned to control how quickly these changes in status can happen). Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Hello all, I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt I submitted bfs.patch at http://bugs.python.org/issue7946. It is work in progress but is ready for some opinion. On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case. The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution. The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios: With Floren't write test (http://bugs.python.org/issue7946#msg101120): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s. With UDP test (http://bugs.python.org/file16316/udp-iotest.py), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread. The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems. Any thoughts? Nir On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou <solipsis@pitrou.net>wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org

Nir, Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa... .html
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met. You might want to look at the following for a discussion (not sure how good it is, as I just google¹d it quickly) of how to implement POSIX semantics on Windows:
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound. -peter On 4/11/10 6:50 PM, "Nir Aides" <nir@winpdb.org> wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt
I submitted bfs.patch at http://bugs.python.org/issue7946. It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test (http://bugs.python.org/issue7946#msg101120): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test (http://bugs.python.org/file16316/udp-iotest.py), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

Hi Peter, There is no need for a loop in bfs_yield(). On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante <peter.a.portante@gmail.com>wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa...
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google’d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org> wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt
I submitted bfs.patch at http://bugs.python.org/issue7946. It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net> wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

Hmm, so I see in bfs_yield(): + if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + } So, to me, the above code says, ³Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate². So the predicate is: ³(tstate != NULL && bfs_thread_switch == tstate)². If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don¹t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can¹t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful. On Windows, before this loop in bfs_schedule(): + COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + } You might want to avoid the call to reset the condition variable if the predicate is already False. -peter On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org> wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante <peter.a.portante@gmail.com> wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa it.html <http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedw ait.html>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google¹d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html <http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt <http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 <http://bugs.python.org/issue7946#msg101120> ): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py <http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net <http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org <http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c om <http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail. com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

And why the for(;;) loop in bfs_schedule()? I don¹t see a code path that would loop there. Perhaps I am missing it ... -peter On 4/12/10 8:37 AM, "Peter Portante" <peter.a.portante@gmail.com> wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, ³Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate². So the predicate is: ³(tstate != NULL && bfs_thread_switch == tstate)².
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don¹t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can¹t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org> wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante <peter.a.portante@gmail.com> wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedw ait.html <http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timed wait.html>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google¹d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html <http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt <http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 <http://bugs.python.org/issue7946#msg101120> ): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py <http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net <http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org <http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail. com <http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail .com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.co>> m

At some point there was a loop, later it remained since I feel it is more readable than a bunch of nested if-else clauses. Should probably replace with do {...} while(0) I was conditioned with electrical shocks in the dungeons of a corporate to always use for loops. I uploaded the patch to Rietveld for code review comments: http://codereview.appspot.com/857049 Nir On Mon, Apr 12, 2010 at 3:48 PM, Peter Portante <peter.a.portante@gmail.com>wrote:
And why the for(;;) loop in bfs_schedule()? I don’t see a code path that would loop there. Perhaps I am missing it ...
-peter
On 4/12/10 8:37 AM, "Peter Portante" <peter.a.portante@gmail.com> wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, “Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate”. So the predicate is: “(tstate != NULL && bfs_thread_switch == tstate)”.
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don’t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can’t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org> wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante < peter.a.portante@gmail.com> wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html< http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa...>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google’d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html < http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt < http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 < http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 < http://bugs.python.org/issue7946#msg101120> ): 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py < http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload: both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net < http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org < http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com< http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

The loop-less wait is similar to the one in new GIL. It is used to force a switch to next thread in particular scenario and the motivation is explained in comment to another if clause a few lines up. Those two if clauses can be joined though. On Mon, Apr 12, 2010 at 3:37 PM, Peter Portante <peter.a.portante@gmail.com>wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, “Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate”. So the predicate is: “(tstate != NULL && bfs_thread_switch == tstate)”.
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don’t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can’t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org> wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante < peter.a.portante@gmail.com> wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html< http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa...>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google’d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html < http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt < http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 < http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 < http://bugs.python.org/issue7946#msg101120> ):
2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py < http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload:
both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net <http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org < http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com< http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

Yes, but unless you loop until the predicate is False, no forced switch is guaranteed to occur. You might as well remove the code. If you want to keep the code as is, call me when you need a life guard to help debug mystifying behaviors. ;) -peter On 4/12/10 3:17 PM, "Nir Aides" <nir@winpdb.org> wrote:
The loop-less wait is similar to the one in new GIL. It is used to force a switch to next thread in particular scenario and the motivation is explained in comment to another if clause a few lines up. Those two if clauses can be joined though.
On Mon, Apr 12, 2010 at 3:37 PM, Peter Portante <peter.a.portante@gmail.com> wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, ³Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate². So the predicate is: ³(tstate != NULL && bfs_thread_switch == tstate)².
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don¹t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can¹t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante <peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> > wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timed wait.html <http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_time dwait.html>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google¹d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html <http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt <http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 <http://bugs.python.org/issue7946#msg101120> ):
2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py <http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload:
both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net <http://solipsis@pitrou.net> <http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org <http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail .com <http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmai l.com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c om
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

Please describe clearly a step by step scenario in which that code will fail. On Mon, Apr 12, 2010 at 10:25 PM, Peter Portante <peter.a.portante@gmail.com
wrote:
Yes, but unless you loop until the predicate is False, no forced switch is guaranteed to occur. You might as well remove the code. If you want to keep the code as is, call me when you need a life guard to help debug mystifying behaviors. ;) -peter
On 4/12/10 3:17 PM, "Nir Aides" <nir@winpdb.org> wrote:
The loop-less wait is similar to the one in new GIL. It is used to force a switch to next thread in particular scenario and the motivation is explained in comment to another if clause a few lines up. Those two if clauses can be joined though.
On Mon, Apr 12, 2010 at 3:37 PM, Peter Portante < peter.a.portante@gmail.com> wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, “Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate”. So the predicate is: “(tstate != NULL && bfs_thread_switch == tstate)”.
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don’t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can’t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> > wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante < peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> > wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html< http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa...>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google’d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html < http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" < nir@winpdb.org <http://nir@winpdb.org> <http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt < http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 < http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 < http://bugs.python.org/issue7946#msg101120> ):
2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py < http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload:
both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < solipsis@pitrou.net <http://solipsis@pitrou.net> < http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> < http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org < http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> < http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com< http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

That code will not switch to another thread if the condition variable either does not actually block the thread on the call (which is allowed by the standard to give implementations some flexibility for making it work correctly read the standard reasoning for more information), or the thread is woken up without a predicate change (also allowed by the standard for similar reasons). Both of those cases are called ³spurious wake-ups². You may not be able to readily get your implementation to behavior that way, but in the wild, we need to account for this behavior because Cpython will be run on systems where it will happen. :) -peter On 4/12/10 3:36 PM, "Nir Aides" <nir@winpdb.org> wrote:
Please describe clearly a step by step scenario in which that code will fail.
On Mon, Apr 12, 2010 at 10:25 PM, Peter Portante <peter.a.portante@gmail.com> wrote:
Yes, but unless you loop until the predicate is False, no forced switch is guaranteed to occur. You might as well remove the code. If you want to keep the code as is, call me when you need a life guard to help debug mystifying behaviors. ;) -peter
On 4/12/10 3:17 PM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> > wrote:
The loop-less wait is similar to the one in new GIL. It is used to force a switch to next thread in particular scenario and the motivation is explained in comment to another if clause a few lines up. Those two if clauses can be joined though.
On Mon, Apr 12, 2010 at 3:37 PM, Peter Portante <peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> > wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, ³Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate². So the predicate is: ³(tstate != NULL && bfs_thread_switch == tstate)².
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don¹t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can¹t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> <http://nir@winpdb.org> > wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante <peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> <http://peter.a.portante@gmail.com> > wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
> http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_tim > edwait.html > <http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_ti > medwait.html>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google¹d it quickly) of how to implement POSIX semantics on Windows:
> http://www.cs.wustl.edu/~schmidt/win32-cv-1.html > <http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" <
nir@winpdb.org <http://nir@winpdb.org> <http://nir@winpdb.org> <http://nir@winpdb.org> > wrote:
> Hello all, > > I would like to kick this discussion back to life with a simplified > implementation of the BFS scheduler, designed by the Linux kernel hacker > Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt > <http://ck.kolivas.org/patches/bfs/sched-BFS.txt> > > I submitted bfs.patch at http://bugs.python.org/issue7946 > <http://bugs.python.org/issue7946> . It is work in progress but is ready > for some opinion. > > > On my machine BFS gives comparable performance to gilinter, and seems to > schedule threads more fairly, predictably, and with lower rate of > context switching. Its basic design is very simple but nevertheless it > was designed by an expert in this field, two characteristics which > combine to make it attractive to this case. > > The problem addressed by the GIL has always been *scheduling* threads to > the interpreter, not just controlling access to it, and therefore the > GIL, a lock implemented as a simple semaphore was the wrong solution. > > The patches by Antoine and David essentially evolve the GIL into a > scheduler, however both cause thread starvation or high rate of context > switching in some scenarios: > > With Floren't write test ( http://bugs.python.org/issue7946#msg101120 > <http://bugs.python.org/issue7946#msg101120> ): > > 2 bg threads, 2 cores set to performance, karmic, PyCon patch, context > switching shoots up to 200K/s. > 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter > patch starves one of the bg threads. > 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads > starved, context switching shoots up to 250K/s. > > With UDP test ( http://bugs.python.org/file16316/udp-iotest.py > <http://bugs.python.org/file16316/udp-iotest.py> ), add > zlib.compress(b'GIL') to the workload: > > both gilinter and PyCon patches starve the IO thread. > > The BFS patch currently involves more overhead by reading the time stamp > on each yield and schedule operations. In addition it still remains to > address some issues related to timestamps such as getting different time > stamp readings on different cores on some (older) multi-core systems. > > Any thoughts? > > Nir > > > > On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou < > > solipsis@pitrou.net <http://solipsis@pitrou.net> > <http://solipsis@pitrou.net> <http://solipsis@pitrou.net> > wrote:
Hello, As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946> This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second. I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads": - the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again. - the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all. Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley. Any thoughts? Regards Antoine. ______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> <http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev <http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org <http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
> > > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org <http://Python-Dev@python.org> > <http://Python-Dev@python.org> <http://Python-Dev@python.org> > http://mail.python.org/mailman/listinfo/python-dev > <http://mail.python.org/mailman/listinfo/python-dev> > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gma > il.com > <http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gm > ail.com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail .com
_______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c om
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

There is no need for the "forced" switch to be guaranteed. The motivation for the wait is to increase throughput and decrease latency when OS schedules next thread to run on same core as yielding thread but yielding thread is about to continue running into CPU intensive library code. An occasional spurious wakeup resulting in missed switch will not affect that. Note similar logic in new GIL. I will add the loop to keep the code clean. Thanks, Nir On Mon, Apr 12, 2010 at 10:49 PM, Peter Portante <peter.a.portante@gmail.com
wrote:
That code will not switch to another thread if the condition variable either does not actually block the thread on the call (which is allowed by the standard to give implementations some flexibility for making it work correctly – read the standard reasoning for more information), or the thread is woken up without a predicate change (also allowed by the standard for similar reasons). Both of those cases are called “spurious wake-ups”.
You may not be able to readily get your implementation to behavior that way, but in the wild, we need to account for this behavior because Cpython will be run on systems where it will happen. :)
-peter
On 4/12/10 3:36 PM, "Nir Aides" <nir@winpdb.org> wrote:
Please describe clearly a step by step scenario in which that code will fail.
On Mon, Apr 12, 2010 at 10:25 PM, Peter Portante < peter.a.portante@gmail.com> wrote:
Yes, but unless you loop until the predicate is False, no forced switch is guaranteed to occur. You might as well remove the code. If you want to keep the code as is, call me when you need a life guard to help debug mystifying behaviors. ;) -peter
On 4/12/10 3:17 PM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> > wrote:
The loop-less wait is similar to the one in new GIL. It is used to force a switch to next thread in particular scenario and the motivation is explained in comment to another if clause a few lines up. Those two if clauses can be joined though.
On Mon, Apr 12, 2010 at 3:37 PM, Peter Portante < peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> > wrote:
Hmm, so I see in bfs_yield():
+ if (tstate != NULL && bfs_thread_switch == tstate) { + COND_RESET(tstate->bfs_cond); + COND_WAIT(tstate->bfs_cond, bfs_mutex); + }
So, to me, the above code says, “Wait for the condition that tstate is either NULL, or bfs_thread_switch does not equal tstate”. So the predicate is: “(tstate != NULL && bfs_thread_switch == tstate)”.
If the predicate is True before you call COND_WAIT() and True after you call COND_WAIT(), either you don’t need to call COND_WAIT() at all, or you need to loop until the predicate is False. There is no guarantee that a condition wait actually did anything at all. Yes, there will be spurious wake ups, but you can’t tell if the thread actually blocked and then woke up, or never blocked at all. If it never actually blocks, then that code path is not helpful.
On Windows, before this loop in bfs_schedule():
+ COND_RESET(tstate->bfs_cond); + while (bfs_thread != tstate) { + _bfs_timed_wait(tstate, timestamp); + timestamp = get_timestamp(); + }
You might want to avoid the call to reset the condition variable if the predicate is already False.
-peter
On 4/12/10 8:12 AM, "Nir Aides" <nir@winpdb.org <http://nir@winpdb.org> < http://nir@winpdb.org> > wrote:
Hi Peter,
There is no need for a loop in bfs_yield().
On Mon, Apr 12, 2010 at 4:26 AM, Peter Portante < peter.a.portante@gmail.com <http://peter.a.portante@gmail.com> < http://peter.a.portante@gmail.com> > wrote:
Nir,
Per the POSIX standard, both pthread_cond_wait() and pthread_cond_timedwait() need to be performed in a loop. See the fourth paragraph of the description from:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html< http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa...>
For the Windows side, I think you have a similar problem. Condition variables are signaling mechanisms, and so they have a separate boolean predicate associated with them. If you release the mutex that protects the predicate, then after you reacquire the mutex, you have to reevaluate the predicate to ensure that the condition has actually been met.
You might want to look at the following for a discussion (not sure how good it is, as I just google’d it quickly) of how to implement POSIX semantics on Windows:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html < http://www.cs.wustl.edu/~schmidt/win32-cv-1.html>
Before you can evaluate the effectiveness of any of the proposed scheduling schemes, the fundamental uses of mutexes and condition variables, and their implementations, must be sound.
-peter
On 4/11/10 6:50 PM, "Nir Aides" <
nir@winpdb.org <http://nir@winpdb.org> <http://nir@winpdb.org> < http://nir@winpdb.org> > wrote:
Hello all,
I would like to kick this discussion back to life with a simplified implementation of the BFS scheduler, designed by the Linux kernel hacker Con Kolivas: http://ck.kolivas.org/patches/bfs/sched-BFS.txt < http://ck.kolivas.org/patches/bfs/sched-BFS.txt>
I submitted bfs.patch at http://bugs.python.org/issue7946 < http://bugs.python.org/issue7946> . It is work in progress but is ready for some opinion.
On my machine BFS gives comparable performance to gilinter, and seems to schedule threads more fairly, predictably, and with lower rate of context switching. Its basic design is very simple but nevertheless it was designed by an expert in this field, two characteristics which combine to make it attractive to this case.
The problem addressed by the GIL has always been *scheduling* threads to the interpreter, not just controlling access to it, and therefore the GIL, a lock implemented as a simple semaphore was the wrong solution.
The patches by Antoine and David essentially evolve the GIL into a scheduler, however both cause thread starvation or high rate of context switching in some scenarios:
With Floren't write test ( http://bugs.python.org/issue7946#msg101120 < http://bugs.python.org/issue7946#msg101120> ):
2 bg threads, 2 cores set to performance, karmic, PyCon patch, context switching shoots up to 200K/s. 2 bg threads, 1 core, set to on-demand, karmic, idle machine, gilinter patch starves one of the bg threads. 4 bg threads, 4x1 core xeon, centos 5.3, gilinter patch, all bg threads starved, context switching shoots up to 250K/s.
With UDP test ( http://bugs.python.org/file16316/udp-iotest.py < http://bugs.python.org/file16316/udp-iotest.py> ), add zlib.compress(b'GIL') to the workload:
both gilinter and PyCon patches starve the IO thread.
The BFS patch currently involves more overhead by reading the time stamp on each yield and schedule operations. In addition it still remains to address some issues related to timestamps such as getting different time stamp readings on different cores on some (older) multi-core systems.
Any thoughts?
Nir
On Sun, Mar 14, 2010 at 12:46 AM, Antoine Pitrou <
solipsis@pitrou.net <http://solipsis@pitrou.net> < http://solipsis@pitrou.net> <http://solipsis@pitrou.net> > wrote:
Hello,
As some of you may know, Dave Beazley recently exhibited a situation where the new GIL shows quite a poor behaviour (the old GIL isn't very good either, but still a little better). This issue is followed in http://bugs.python.org/issue7946 <http://bugs.python.org/issue7946>
This situation is when an IO-bound thread wants to process a lot of incoming packets, while one (or several) CPU-bound thread is also running. Each time the IO-bound thread releases the GIL, the CPU-bound thread gets it and keeps holding it for at least 5 milliseconds (default setting), which limits the number of individual packets which can be recv()'ed and processed per second.
I have proposed two mechanisms, based on the same idea: IO-bound threads should be able to steal the GIL very quickly, rather than having to wait for the whole "thread switching interval" (again, 5 ms by default). They differ in how they detect an "IO-bound threads":
- the first mechanism is actually the same mechanism which was embodied in the original new GIL patch before being removed. In this approach, IO methods (such as socket.read() in socketmodule.c) releasing the GIL must use a separate C macro when trying to get the GIL back again.
- the second mechanism dynamically computes the "interactiveness" of a thread and allows interactive threads to steal the GIL quickly. In this approach, IO methods don't have to be modified at all.
Both approaches show similar benchmark results (for the benchmarks that I know of) and basically fix the issue put forward by Dave Beazley.
Any thoughts?
Regards
Antoine.
______________________________ _________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> < http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org < http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> < http://Python-Dev@python.org> <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev < http://mail.python.org/mailman/listinfo/python-dev> Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.com< http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...>
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> < http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org <http://Python-Dev@python.org> http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...
participants (12)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Benjamin Peterson
-
Bill Janssen
-
Brett Cannon
-
Cameron Simpson
-
Daniel Stutzbach
-
Gregory P. Smith
-
Nick Coghlan
-
Nir Aides
-
Peter Portante
-
skip@pobox.com