[Python-Dev] deja-vu .. python locking

Mon Sep 18 15:46:02 CEST 2006

Hello,

as someone has written in FAQ, sometimes someone starts a thread about
finer grained locking in Python.
Ok here is one.

I don't want to start a flamewar. I only seek suggestions and constructive
critic. I have some ideas whose are new in this context (I believe) and
I only wanted to make them public in case someone finds them interesting.

Comments are welcome.
Martin

------------
Round 1, Greg Stein's patch
  The patch removes GIL from version 1.6 and replaces locking of
  list, dict and other structures with finer grained locking.
  The major slowdown seems to be in list and dict structures, dicts
  are used for object attributes and these are accessed quite often.

  Because (IIUC) mutex struct is quite heavy, dict and list are
  locked via pool of locks. When you lock this pooled lock you
  have to lock two locks in reality. One locks pool itself, and other
  locks the pooled lock (the second locking can be omited in non
  contended case because locks in the pool are in locked state).
  One lock take about 25 cycles on UP P4 (mainly pipeline flush
  during memory barrier) and can be even more expensive (hundreds
  of cycles) due to cacheline move between CPUs on SMP machine.
  "Global" pool lock is subject to cacheline pinpong as it will
  be often reacquired by competing CPUs.
  In mappinglookup there is lookmapping guarded by this locking scheme,
  lookmapping itself has about 20 cycles in the best (one hope typical) case
  plus compareobj cost (in case of string keys say ... 50..100 cycles?).
  Thus locking/unlocking the read takes 50..100 cycles and operation
  itself is 70-120 cycles.
  One might expect about 50% slowdown in dict read path.

RCU like locking
  Solution I have in mind is similar to RCU. In Python we have quiscent
  state - when a thread returns to main loop of interpreter.
  Let's add "owner_thread" field to locked object. It reflects last thread
  (its id) which called any lockable method on the object.
  Each LOCK operation looks like:
  while (ob->owner_thread != self_thread()) {
	 unlock_mutex(thread_mutex[self_thread()])
	// wait for owning thread to go to quiscent state
	 lock_mutex(thread_mutex[ob->owner_thread])
	 ob->owner_thread = self_thread()
	 unlock_mutex(thread_mutex[ob->owner_thread])
	 lock_mutex(thread_mutex[self_thread()])
  }
  Unlock is not done - we own the object now and can use it without locking
  (until we return to interpreter loop or we call LOCK on other object).
  For non-shared objects there is only penalty of ob->owner_thread != self_thread()
  condition. Not sure about Windows, but in recent Linuxes one can use %gs register
  as thread id, thus compare is about 3 cycles (and owner_thread should be
  in object's cacheline anyway).
  In contended case there is some cache pingpong with ob and mutex but it is
  as expected.

Deadlocks
  Our object ownership is long - from getting it in LOCK to next quiscent state
  of the thread. Thus when two threads want to step each on other's object, they
  will deadlock. Simple solution is to extend set of quiscent states.
  It is when thread releases its thread_mutex in main loop (and immediately
  reacquires). Additionaly it can release it just before it is going to wait
  on another thread's mutex, like in LOCK (already in code above). If you use
  LOCK correctly then when you are LOCKing an object you can't be in vulnerable
  part of OTHER object. So that let other threads to get ownership of your own
  objects in that time.
  One can also want to release his lock when going to lock mutex in threading
  package and in other places where GIL is released today.
  However I admit that I did no formal proof regarding deadlock, I plan
  to do it if nobody can find other flaw in the proposal.

Big reader lock
  While above scheme might work well, it'd impose performance penalty
  for shared dicts which are almost read only (module.__dict__).
  For these similar locking can be used, only writer has to wait until
  ALL other threads enter quiscent state (take locks of them), then perform
  change and unlock them all. Readers can read without any locking.

Compatibilty with 3rd party modules
  I've read this argument on pydev list. Maybe I'm not understanding something,
  but is it so complex for Py_InitModule4 to use extra flag in apiver for example ?
  When at least one non-freethreaded module is loaded, locking is done in
  old good way...