[Python-Dev] "Fixing" the new GIL

Tue Mar 16 14:46:08 CET 2010

You are right, I was assuming a mutex. (why not then, use a semaphore?)
But the implementation isn't a simple binary semaphore.
It suffers from case 2) that I explained before:  Greedy aquisition of the "semaphore" even if someone is waiting.
Also, I should add, a condition variable doesn't specify that it employs a FIFO.  pythread_cond_signals() wakes up _a_ thread.  If you want particular behaviour then you have to add code to deal with that.  But it tries to be fair, and we can probably assum e FIFO behavioud, so let's ignore that part.  So, in the following, let's assume that the condition variable is fair, and that it queues up the waiting threads in a FIFO manner:

Fixing 2) in this case is easy: Make sure you wait if others are waiting.  In python pseudocode:
def Aquire(gil):
  with gil.condition:
    if gil.locked or gil.n_waiting:
      gil.n_waiting+=1
        while gil.locked:
          gil.condition.wait()
      gil.n_waiting-=1
    gil.locked = True

def Release(gil):
  with gil.condition:
    gil.locked=False
    if gil.n_waiting:
      gil.condition.notify()

This ought to fix problem 2) below.  Multicore CPUs ought to not behave worse than single core anymore.
(Btw, the C code is erroneous, one should not call pthread_cond_signal without having the lock held, but that has been brought up before.)

Now, if you want to improve IO performance, by ensuring that IO threads get woken up before (yielding) cpu threads, you can do something like this:
class GIL():
  def __init__(self, npri):
    self.lock=RLock()
    self.queues = [[Condition(self.lock), 0] for i in range(npri)]
    self.locked = False

  def Acquire(self, pri):
    """Acquire the lock with priority "pri", where 0 is highest priority"""
    with self.lock:
    if self.locked or sum([q[1] for q in self.queues[:pri+1]]) #locked or someone with same or higher priority waiting
      self.queues[pri][1]+=1
      while self.locked:
        self.queues[pri][0].wait()
      self.queues[pri][1]-=1
    self.locked = True

  def Release(self):
    with self.lock:
      self.locked = False
      for q in self.queues:
        if q[1]:
          q[0].notify()
          break

  def Blip(self):
    """volountailily give up the lock and allow higher priority requests to run"""
    self.Release()
    self.Acquire(len(self.queues)-1)

Typically you would have only two priority classes. IO would then be done with:
gil.Release()
r = socket.recv(l)
gil.Acquire(0)

But here we are on dangerous grounds. Priority scheduling is full of pitfalls.
I should add here maybe, that I am a strong proponent of strict FIFO scheduling with perhaps few special cases.  We have been using FIFO tasklet scheduling for all tasklets, including those that are returing with IO, in EVE Online for three years now.  Before, we had somewhat chaotic scheduling (immediate switching to tasklets) and behavior was very erratic.  Switching to FIFO cured that and was very wholesome for the cluster in general..  I have been toying with the idea of adding a special IO tasklet priority class to stackless for a while, but haven't done it yet.  I am doubtful that it will on the whole change matters much.

Cheers,

Kristján

> -----Original Message-----
> From: David Beazley [mailto:dave at dabeaz.com]
> Sent: 16. mars 2010 11:03
> To: Kristján Valur Jónsson
> Cc: David Beazley; python-dev at python.org
> Subject: Re: [Python-Dev] "Fixing" the new GIL
> 
> Python doesn't use a pthreads mutex for the GIL.    It has always used
> a binary semaphore implemented with condition variables (or just a
> pthreads semaphore if available).    The reason the performance is so
> bad is precisely due to the fact that it is using this implementation
> and the fact that there *IS* a FIFO queue of threads (associated with
> the condition variable).   The I/O performance problem with the new GIL
> is gets much worse with many CPU-bound threads precisely because there
> is a FIFO queue involved.   This has been covered in my past GIL
> presentations.
> 
> -Dave
> 
> 
> 
> On Mar 16, 2010, at 5:52 AM, Kristján Valur Jónsson wrote:
> 
> > How about attacking the original problem, then?
> >
> > The reason they thrash on pthreads implementation is that a pthreads
> mutex is assumed to be a short-held resource.  Therefore it will be
> optimized in the following ways for multicore machines:
> > 1) There is a certain amount of spinning done, to try to acquire it
> before blocking
> > 2) It will employ un-fair tactics to avoid lock-convoying, meaning
> that a thread coming in to acquire the mutex may get in before others
> that are queued.  This is why "ticking" the GIL works so badly:  The
> thread that releases the lock is usually the one that reaquires it even
> though others may be waiting.  See e.g.
> http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-
> 8f85-616ef7b031aa.aspx for a discussion of this (albeit on windows.)
> >
> > On Windows, this isn't a problem.  The reason is, that the GIL on
> windows is implemented using Event objects that don't cut these
> corners.  The Event provides you with a strict FIFO queue of objects
> waiting for the event.
> >
> > If pthreads doesn't provide a synchronization primitive similar to
> that, someone that doesn't thrash and has a true FIFO queue, it is
> possible to construct such a thing using condition variables and
> critical sections.  Perhaps the posix semaphore api is more appropriate
> in this case.
> >
> > By the way, this also shows another problem with (old) python.  There
> is only one core locking primitive, the PyThread_type_lock.  It is
> being used both as a critical section in the traditional sense, and
> also as this sort-of-inverse lock that the GIL is.  In the modern
> world, where the intended behaviour of these is quite different, there
> is no one-size-fits all.  On windows in particular, the use of the
> Event object based lock is not ideal for other uses than the GIL.
> >
> >
> > In the new GIL, there appear to be several problems:
> > 1) There is no FIFO queue of threads wanting the queue, thus thread
> scheduling becomes non-deterministic
> > 2) The "ticking" of the GIL is now controled by a condition variable
> timeout.  There appears to be no way to prevent many such timeouts to
> be in progress at the same time, thus you may have an unnecessarily
> high rate of ticking going on.
> > 3) There isn't an immediate gil request made when an IO thread
> requests the gil back, only after an initial timeout.
> >
> > What we are trying to write here is a thread scheduler, and that is
> complex business.
> > K
> >
> >
> >
> >> -----Original Message-----
> >> From: python-dev-bounces+kristjan=ccpgames.com at python.org
> >> [mailto:python-dev-bounces+kristjan=ccpgames.com at python.org] On
> Behalf
> >> Of David Beazley
> >> Sent: 15. mars 2010 03:07
> >> To: python-dev at python.org
> >> Subject: Re: [Python-Dev] "Fixing" the new GIL
> >>
> >> happen to be performing CPU intensive work at the same time, it
> would
> >> be nice if they didn't thrash on multiple cores (the problem with
> the
> >> old GIL) and if I/O is
> >
>