Re: [Python-Dev] "Fixing" the new GIL

I have been following this discussion about fixing the GIL and just wanted to make a few comments about it. To the doubters who don't think this is a real problem worth fixing, I must respectfully disagree. Multicore systems and parallel programming issues are not going away any time in the foreseeable future. Moreover, there are plenty of programmers who are using Python to build applications involving concurrency, distributed computing, and compute intensive processing. While everyone agrees that it would be great if the GIL simply disappeared, the next best option is for the GIL to have well-defined and reasonably predictable behavior (i.e., behavior that can be easily explained and justified). Specifically, if multiple threads happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is mixed with CPU-bound processing it would be nice if I/O requests didn't have their responsiveness and priority penalized (the problem with the new GIL). So, just to be clear about the my bug report, it is directly related to the problem of overlapping I/O requests with CPU-bound processing. This kind of scenario comes up in the context of many applications--especially those based on cooperating processes, multiprocessing, and message passing. Ironically, these are exactly the kinds of parallel programming techniques that people are going to use because of the GIL in the first place. So, why would you want the new GIL to behave poorly here? In any case, if the GIL can be improved in a way that is simple and which either improves or doesn't negatively impact the performance of existing applications, why wouldn't you want to do it? Seems like a no-brainer. Cheers, Dave

So, just to be clear about the my bug report, it is directly related to the problem of overlapping I/O requests with CPU-bound processing. This kind of scenario comes up in the context of many applications--especially those based on cooperating processes, multiprocessing, and message passing.
How so? if you have cooperating processes, multiprocessing, and message passing, you *don't* have CPU bound threads along with IO "bound" threads in the same process - you may not have threads at all!!!
In any case, if the GIL can be improved in a way that is simple and which either improves or doesn't negatively impact the performance of existing applications, why wouldn't you want to do it? Seems like a no-brainer.
Unfortunately, a simple improvement doesn't really seem to exist. Regards, Martin

How about attacking the original problem, then? The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-... for a discussion of this (albeit on windows.) On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event. If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case. By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL. In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout. What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is

Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations. -Dave On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-... for a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is

You are right, I was assuming a mutex. (why not then, use a semaphore?) But the implementation isn't a simple binary semaphore. It suffers from case 2) that I explained before: Greedy aquisition of the "semaphore" even if someone is waiting. Also, I should add, a condition variable doesn't specify that it employs a FIFO. pythread_cond_signals() wakes up _a_ thread. If you want particular behaviour then you have to add code to deal with that. But it tries to be fair, and we can probably assum e FIFO behavioud, so let's ignore that part. So, in the following, let's assume that the condition variable is fair, and that it queues up the waiting threads in a FIFO manner: Fixing 2) in this case is easy: Make sure you wait if others are waiting. In python pseudocode: def Aquire(gil): with gil.condition: if gil.locked or gil.n_waiting: gil.n_waiting+=1 while gil.locked: gil.condition.wait() gil.n_waiting-=1 gil.locked = True def Release(gil): with gil.condition: gil.locked=False if gil.n_waiting: gil.condition.notify() This ought to fix problem 2) below. Multicore CPUs ought to not behave worse than single core anymore. (Btw, the C code is erroneous, one should not call pthread_cond_signal without having the lock held, but that has been brought up before.) Now, if you want to improve IO performance, by ensuring that IO threads get woken up before (yielding) cpu threads, you can do something like this: class GIL(): def __init__(self, npri): self.lock=RLock() self.queues = [[Condition(self.lock), 0] for i in range(npri)] self.locked = False def Acquire(self, pri): """Acquire the lock with priority "pri", where 0 is highest priority""" with self.lock: if self.locked or sum([q[1] for q in self.queues[:pri+1]]) #locked or someone with same or higher priority waiting self.queues[pri][1]+=1 while self.locked: self.queues[pri][0].wait() self.queues[pri][1]-=1 self.locked = True def Release(self): with self.lock: self.locked = False for q in self.queues: if q[1]: q[0].notify() break def Blip(self): """volountailily give up the lock and allow higher priority requests to run""" self.Release() self.Acquire(len(self.queues)-1) Typically you would have only two priority classes. IO would then be done with: gil.Release() r = socket.recv(l) gil.Acquire(0) But here we are on dangerous grounds. Priority scheduling is full of pitfalls. I should add here maybe, that I am a strong proponent of strict FIFO scheduling with perhaps few special cases. We have been using FIFO tasklet scheduling for all tasklets, including those that are returing with IO, in EVE Online for three years now. Before, we had somewhat chaotic scheduling (immediate switching to tasklets) and behavior was very erratic. Switching to FIFO cured that and was very wholesome for the cluster in general.. I have been toying with the idea of adding a special IO tasklet priority class to stackless for a while, but haven't done it yet. I am doubtful that it will on the whole change matters much. Cheers, Kristján
-----Original Message----- From: David Beazley [mailto:dave@dabeaz.com] Sent: 16. mars 2010 11:03 To: Kristján Valur Jónsson Cc: David Beazley; python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations.
-Dave
On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f- 8f85-616ef7b031aa.aspx for a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is

Hello Dave, The following documentation suggests ordering in Linux is not FIFO: http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa... "Threads waiting on mutexes and condition variables are selected to proceed in an order dependent upon the scheduling policy rather than in some fixed order (for example, FIFO or priority). Thus, the scheduling policy determines which thread(s) are awakened and allowed to proceed." Here is the code: http://www.google.com/codesearch/p?hl=en#5ge3gHPB4K4/gnu/glibc/glibc-linuxthreads-2.1.1.tar.gz%7CaeB7Uqo7T9g/linuxthreads/queue.h&q=pthread_cond_timedwait&exact_package=http://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.1.1.tar.gz If this is so then it should affect the proposed fixes. For example waiting with timeout should be avoided, no? Nir 2010/3/16 David Beazley <dave@dabeaz.com>
Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations.
-Dave
On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-... a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan <python-dev-bounces%2Bkristjan>= ccpgames.com@python.org] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org

Yes, having another thread wait with a timeout is not good. CPU cycles waisted doing things that don¹t help the thread holding the GIL release it. They just note that they are present and then the GIL-holding-thread should be responsible for handling the rest. -peter On 3/16/10 1:11 PM, "Nir Aides" <nir@winpdb.org> wrote:
Hello Dave,
The following documentation suggests ordering in Linux is not FIFO: http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa... .html#tag_03_518_08_06 "Threads waiting on mutexes and condition variables are selected to proceed in an order dependent upon the scheduling policy rather than in some fixed order (for example, FIFO or priority). Thus, the scheduling policy determines which thread(s) are awakened and allowed to proceed."
Here is the code: http://www.google.com/codesearch/p?hl=en#5ge3gHPB4K4/gnu/glibc/glibc-linuxth... ads-2.1.1.tar.gz%7CaeB7Uqo7T9g/linuxthreads/queue.h&q=pthread_cond_timedwait&e xact_package=http://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.1.1.tar.gz
If this is so then it should affect the proposed fixes. For example waiting with timeout should be avoided, no?
Nir
2010/3/16 David Beazley <dave@dabeaz.com>
Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations.
-Dave
On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85- 616ef7b031aa.aspx for a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com <http://ccpgames.com> @python.org <http://python.org> [mailto:python-dev-bounces+kristjan <mailto:python-dev-bounces%2Bkristjan> =ccpgames.com <http://ccpgames.com> @python.org <http://python.org> ] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...

The Linux kernel scheduler deals with two types of ratings and has a heuristic algorithm that rewards and penalizes on a rapid basis. To determine the next thread it inspects the priority array to find the highest priority task that is runnable and then selects the first task in that priority. This is a gross simplification of an extremely complex system, but it is not simply a simple queue. To duplicate this exact type of scheduling in the interpreter would be a daunting task, but Dave's example illustrates that some system of reward and penalty can be beneficial. Peter has brought up the point, both here and at the Open Space at Pycon, that before attempting a downsized version of the a priority thread scheduler that we should look at other programs that have had to deal with similar problems. Peter, since you have dealt with the nitty-gritty of concurrency before, can you suggest some applications that could serve as models for research? On Tue, Mar 16, 2010 at 3:22 PM, Peter Portante <peter.a.portante@gmail.com>wrote:
Yes, having another thread wait with a timeout is not good. CPU cycles waisted doing things that don’t help the thread holding the GIL release it. They just note that they are present and then the GIL-holding-thread should be responsible for handling the rest. -peter
On 3/16/10 1:11 PM, "Nir Aides" <nir@winpdb.org> wrote:
Hello Dave,
The following documentation suggests ordering in Linux is not FIFO:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa... "Threads waiting on mutexes and condition variables are selected to proceed in an order dependent upon the scheduling policy rather than in some fixed order (for example, FIFO or priority). Thus, the scheduling policy determines which thread(s) are awakened and allowed to proceed."
Here is the code:
If this is so then it should affect the proposed fixes. For example waiting with timeout should be avoided, no?
Nir
2010/3/16 David Beazley <dave@dabeaz.com>
Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations.
-Dave
On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-... a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com <http://ccpgames.com> @ python.org <http://python.org> [mailto:python-dev-bounces+kristjan <python-dev-bounces+kristjan> < mailto:python-dev-bounces%2Bkristjan <python-dev-bounces%2Bkristjan>> = ccpgames.com <http://ccpgames.com> @python.org <http://python.org> ] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org
------------------------------ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/peter.a.portante%40gmail.c...
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/hancock.robert%40gmail.com

I could also point you at EVE online, a large scale IO machine based on stackless python. Our game proxies (front end servers) handle 30.000 client connections each using tasklets. The problem is coarse grained scheduling of IO and other tasks, same as with threads in regular Python. We have found that adhering to FIFO scheduling provides good all round performance, particularly the variance of IO latency is very low. As I‘v mentioned in another message, I am toying with the idea of adding another priority layer to the Stackless scheduler for IO tasks. It´s not trivial though (requires me to do some stackless surgery) and the turnaround time to test new approaches in production for us is a bit high ☺ K From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Robert Hancock Sent: 16. mars 2010 20:10 To: Peter Portante Cc: David Beazley; python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL The Linux kernel scheduler deals with two types of ratings and has a heuristic algorithm that rewards and penalizes on a rapid basis. To determine the next thread it inspects the priority array to find the highest priority task that is runnable and then selects the first task in that priority. This is a gross simplification of an extremely complex system, but it is not simply a simple queue. To duplicate this exact type of scheduling in the interpreter would be a daunting task, but Dave's example illustrates that some system of reward and penalty can be beneficial. Peter has brought up the point, both here and at the Open Space at Pycon, that before attempting a downsized version of the a priority thread scheduler that we should look at other programs that have had to deal with similar problems. Peter, since you have dealt with the nitty-gritty of concurrency before, can you suggest some applications that could serve as models for research?

I'm not sure I see any inherent problem in putting all I/O bound threads in some kind of FIFO queue. However, there still should be some mechanism that gives those threads priority over a CPU-bound thread that never wants to give up the GIL. A common OS solution is to simply have multiple FIFO queues, each with varying priorities. Threads with low priority don't get to run unless the higher priority queues are empty. One could probably try it with Python by simply having two queues (I/O and CPU) and separating threads into two categories depending on whether they are forcefully timed out or not. An OS would tend to have many more levels, but maybe just two queues would be enough to solve the thread scheduling problems being discussed here. Cheers, Dave On Mar 17, 2010, at 6:01 AM, Kristján Valur Jónsson wrote:
I could also point you at EVE online, a large scale IO machine based on stackless python. Our game proxies (front end servers) handle 30.000 client connections each using tasklets. The problem is coarse grained scheduling of IO and other tasks, same as with threads in regular Python. We have found that adhering to FIFO scheduling provides good all round performance, particularly the variance of IO latency is very low. As I‘v mentioned in another message, I am toying with the idea of adding another priority layer to the Stackless scheduler for IO tasks. It´s not trivial though (requires me to do some stackless surgery) and the turnaround time to test new approaches in production for us is a bit high J
K
From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Robert Hancock Sent: 16. mars 2010 20:10 To: Peter Portante Cc: David Beazley; python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
The Linux kernel scheduler deals with two types of ratings and has a heuristic algorithm that rewards and penalizes on a rapid basis. To determine the next thread it inspects the priority array to find the highest priority task that is runnable and then selects the first task in that priority. This is a gross simplification of an extremely complex system, but it is not simply a simple queue.
To duplicate this exact type of scheduling in the interpreter would be a daunting task, but Dave's example illustrates that some system of reward and penalty can be beneficial.
Peter has brought up the point, both here and at the Open Space at Pycon, that before attempting a downsized version of the a priority thread scheduler that we should look at other programs that have had to deal with similar problems.
Peter, since you have dealt with the nitty-gritty of concurrency before, can you suggest some applications that could serve as models for research?

Hi, I posted a small patch to the GIL which demonstrates the scheduling policy of pthreads conditions. It seems that with pthreads a good policy is to allow (and help) the OS to manage scheduling of threads via the condition queue without introducing scheduling logic. Some changes include removing the timeout from the new GIL wait, and allowing CPU bound threads to run long enough to let the OS recognize them as such. The patch uses ticks countdown-to-switch but it should be changed to a time based countdown (If taking a look at the clock is expensive maybe it can be done once every N ticks during countdown). Remains to explore what can be done on other platforms. Nir 2010/3/16 Nir Aides <nir@winpdb.org>
Hello Dave,
The following documentation suggests ordering in Linux is not FIFO:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwa... "Threads waiting on mutexes and condition variables are selected to proceed in an order dependent upon the scheduling policy rather than in some fixed order (for example, FIFO or priority). Thus, the scheduling policy determines which thread(s) are awakened and allowed to proceed."
Here is the code:
If this is so then it should affect the proposed fixes. For example waiting with timeout should be avoided, no?
Nir
2010/3/16 David Beazley <dave@dabeaz.com>
Python doesn't use a pthreads mutex for the GIL. It has always used a binary semaphore implemented with condition variables (or just a pthreads semaphore if available). The reason the performance is so bad is precisely due to the fact that it is using this implementation and the fact that there *IS* a FIFO queue of threads (associated with the condition variable). The I/O performance problem with the new GIL is gets much worse with many CPU-bound threads precisely because there is a FIFO queue involved. This has been covered in my past GIL presentations.
-Dave
On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
How about attacking the original problem, then?
The reason they thrash on pthreads implementation is that a pthreads mutex is assumed to be a short-held resource. Therefore it will be optimized in the following ways for multicore machines: 1) There is a certain amount of spinning done, to try to acquire it before blocking 2) It will employ un-fair tactics to avoid lock-convoying, meaning that a thread coming in to acquire the mutex may get in before others that are queued. This is why "ticking" the GIL works so badly: The thread that releases the lock is usually the one that reaquires it even though others may be waiting. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-... a discussion of this (albeit on windows.)
On Windows, this isn't a problem. The reason is, that the GIL on windows is implemented using Event objects that don't cut these corners. The Event provides you with a strict FIFO queue of objects waiting for the event.
If pthreads doesn't provide a synchronization primitive similar to that, someone that doesn't thrash and has a true FIFO queue, it is possible to construct such a thing using condition variables and critical sections. Perhaps the posix semaphore api is more appropriate in this case.
By the way, this also shows another problem with (old) python. There is only one core locking primitive, the PyThread_type_lock. It is being used both as a critical section in the traditional sense, and also as this sort-of-inverse lock that the GIL is. In the modern world, where the intended behaviour of these is quite different, there is no one-size-fits all. On windows in particular, the use of the Event object based lock is not ideal for other uses than the GIL.
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic 2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on. 3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
What we are trying to write here is a thread scheduler, and that is complex business. K
-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan <python-dev-bounces%2Bkristjan>= ccpgames.com@python.org] On Behalf Of David Beazley Sent: 15. mars 2010 03:07 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
happen to be performing CPU intensive work at the same time, it would be nice if they didn't thrash on multiple cores (the problem with the old GIL) and if I/O is
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/nir%40winpdb.org

Hi Kristján,
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic
Thread scheduling has always been non-deterministic.
2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on.
Unless there's a bug, no, there isn't.
3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
This is what we are looking to fix (perhaps).
What we are trying to write here is a thread scheduler, and that is complex business. K
I would have rephrased: "What we are trying to avoid here is a thread scheduler". regards Antoine.

-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou Sent: 16. mars 2010 12:00 To: python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
Hi Kristján,
In the new GIL, there appear to be several problems: 1) There is no FIFO queue of threads wanting the queue, thus thread scheduling becomes non-deterministic
Thread scheduling has always been non-deterministic. Yes, but not the scheduling of "which thread has the GIL" which is what we are talking about. With a gil, you have neutered the non-deterministic OS scheduler and forced the system to _effectively_ schedule threads as you with them to be scheduled. The key to this is implementing your GIL in such a way that you (and not the system) chooses which threads runs next. On Windows it works in a nice, determinitstic FIFO order becoause that's how the underlying Event object is supposed to work. under pthreads, we rely on the behaviour of the condition variable which _probably_ is well behaved in this manner too.
2) The "ticking" of the GIL is now controled by a condition variable timeout. There appears to be no way to prevent many such timeouts to be in progress at the same time, thus you may have an unnecessarily high rate of ticking going on.
Unless there's a bug, no, there isn't.
You're right, I had forgotten about "gil_switch_number". Never mind.
3) There isn't an immediate gil request made when an IO thread requests the gil back, only after an initial timeout.
This is what we are looking to fix (perhaps).
What we are trying to write here is a thread scheduler, and that is complex business. K
I would have rephrased: "What we are trying to avoid here is a thread scheduler".
I can't argue with that, but I'm afraid that in the end, you have to. Python threads do behave very much in the same way as stackless python tasklets do. They own the cpu until they volounteeringly give it up. You want to decide, then, which of the threads next gets its turn. You are working on a completely different level than what the OS usually uses for its scheduling (timeslices, IO) so _you_ have to be in control. By a happy coincidence, in the past on a single CPU, the old GIL did give us this, without us being explicit about it: FIFO execution of tasklets. This fell out as a byproduct of the very simple way that the GIL was implemented, using a single Condition variable. Now that we have a different mechanism, _where threads time out_ on the condition, this behaviour no longer falls out of the implementation on its own. It may turn out, in this new, more complicated world, that you have to explicitly schedule the threads, by chaining them and giving each their own condition to wait on. Btw, if you want to take up a private converstaion with me, I'd like to point out to you some problems with the Windows version of the Condition variable :) Cheers, Kristján

Le Tue, 16 Mar 2010 14:10:33 +0000, Kristján Valur Jónsson <kristjan@ccpgames.com> a écrit :
The key to this is implementing your GIL in such a way that you (and not the system) chooses which threads runs next. On Windows it works in a nice, determinitstic FIFO order becoause that's how the underlying Event object is supposed to work.
Well, I don't think this has ever been by design, and it's not obvious this is desirable either (see Dave Beazley's benchmark).
Btw, if you want to take up a private converstaion with me, I'd like to point out to you some problems with the Windows version of the Condition variable :)
You can report a bug. cheers Antoine.

-----Original Message----- From: python-dev-bounces+kristjan=ccpgames.com@python.org [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Antoine Pitrou
The key to this is implementing your GIL in such a way that you (and not the system) chooses which threads runs next. On Windows it works in a nice, determinitstic FIFO order becoause that's how the underlying Event object is supposed to work.
Well, I don't think this has ever been by design, and it's not obvious this is desirable either (see Dave Beazley's benchmark).
Did Dave benchmark a straight FIFO system? I can tell you out of experience, engineering a similar system (stacklessIO) that you _do_ want a deterministic scheduling algorithm. The most straightforward is round-robin (or FIFO), and the only sane option really if you know nothing about your threads. Upon this, you can start to build, by for example adding another priority layer, but you have to have a solid foundation. We have never had high priority for IO threads in python (and its not-by-design round robin scheduler on single core cpus) and it is unclear why that should be required now as a fix.
Btw, if you want to take up a private converstaion with me, I'd like to point out to you some problems with the Windows version of the Condition variable :)
You can report a bug. Or I could commit a fix. I just thought You might be interested, is all. I'll submit a patch.
K

Kristján Valur Jónsson wrote:
We have never had high priority for IO threads in python (and its not-by-design round robin scheduler on single core cpus) and it is unclear why that should be required now as a fix.
David explained that in the issue tracker - 2.x typically doesn't do that much work per bytecode instruction, so the interpreter releases and reacquires the GIL far more frequently in a CPU-bound thread than it does under the new time-based check interval. The current settings mean we have less GIL overhead in the normal case, but worse worst-case I/O latency. There are two fairly straightforward ways to handle that: - cut the check interval duration drastically (trading GIL overhead for I/O responsiveness, as in 2.x) - add some form of I/O thread prioritisation, such as Antoine's gilinter patch (which automatically identifies and prioritises "interactive" threads at the cost of a dozen or so additional lines of C code and a small amount of overhead on GIL context switches to adjust thread interactivity counters). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

David explained that in the issue tracker - 2.x typically doesn't do that much work per bytecode instruction,
Oh, but that's wrong in general. Dave's *spinning loop* doesn't do much work per bytecode instruction, however ;)
The current settings mean we have less GIL overhead in the normal case, but worse worst-case I/O latency.
Actually, ccbench shows that worst case IO latency is much worse in 2.x (when executing bytecodes which do a lot of work, e.g. matching a regex). What happens though is that best case IO latency is better in 2.x (e.g. spinning loop, or short opcodes approaching the spinning loop case :-)). cheers Antoine.

-----Original Message----- From: Nick Coghlan [mailto:ncoghlan@gmail.com] Sent: 16. mars 2010 22:08 To: Kristján Valur Jónsson Cc: Antoine Pitrou; python-dev@python.org Subject: Re: [Python-Dev] "Fixing" the new GIL
Kristján Valur Jónsson wrote:
We have never had high priority for IO threads in python (and its not-by-design round robin scheduler on single core cpus) and it is unclear why that should be required now as a fix.
David explained that in the issue tracker - 2.x typically doesn't do that much work per bytecode instruction, so the interpreter releases and reacquires the GIL far more frequently in a CPU-bound thread than it does under the new time-based check interval.
Okay, but my original post in this thread was "why don't we fix the original problem with the GIL as implemented in 2.x" rather than to try to beat the new, more complicated, GIL into submission? In fact, I probably should submit a patch to do that since 2.x isn't disappearing quite yet. Cheers, K

Le Tue, 16 Mar 2010 17:02:13 +0000, Kristján Valur Jónsson a écrit :
Well, I don't think this has ever been by design, and it's not obvious this is desirable either (see Dave Beazley's benchmark).
Did Dave benchmark a straight FIFO system?
Well, he presented benchmark figures with two threads, which is a FIFO system with the new GIL (because when a thread releases the GIL, it ensures that another thread takes it).
We have never had high priority for IO threads in python (and its not-by-design round robin scheduler on single core cpus) and it is unclear why that should be required now as a fix.
I agree that Dave's situation doesn't seem new to me, although he often presents it as a deficiency of the new GIL only. Regards Antoine.
participants (8)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
David Beazley
-
Kristján Valur Jónsson
-
Nick Coghlan
-
Nir Aides
-
Peter Portante
-
Robert Hancock