[Python-Dev] deja-vu .. python locking
Martin Devera
devik at cdi.cz
Mon Sep 18 15:46:02 CEST 2006
Hello,
as someone has written in FAQ, sometimes someone starts a thread about
finer grained locking in Python.
Ok here is one.
I don't want to start a flamewar. I only seek suggestions and constructive
critic. I have some ideas whose are new in this context (I believe) and
I only wanted to make them public in case someone finds them interesting.
Comments are welcome.
Martin
------------
Round 1, Greg Stein's patch
The patch removes GIL from version 1.6 and replaces locking of
list, dict and other structures with finer grained locking.
The major slowdown seems to be in list and dict structures, dicts
are used for object attributes and these are accessed quite often.
Because (IIUC) mutex struct is quite heavy, dict and list are
locked via pool of locks. When you lock this pooled lock you
have to lock two locks in reality. One locks pool itself, and other
locks the pooled lock (the second locking can be omited in non
contended case because locks in the pool are in locked state).
One lock take about 25 cycles on UP P4 (mainly pipeline flush
during memory barrier) and can be even more expensive (hundreds
of cycles) due to cacheline move between CPUs on SMP machine.
"Global" pool lock is subject to cacheline pinpong as it will
be often reacquired by competing CPUs.
In mappinglookup there is lookmapping guarded by this locking scheme,
lookmapping itself has about 20 cycles in the best (one hope typical) case
plus compareobj cost (in case of string keys say ... 50..100 cycles?).
Thus locking/unlocking the read takes 50..100 cycles and operation
itself is 70-120 cycles.
One might expect about 50% slowdown in dict read path.
RCU like locking
Solution I have in mind is similar to RCU. In Python we have quiscent
state - when a thread returns to main loop of interpreter.
Let's add "owner_thread" field to locked object. It reflects last thread
(its id) which called any lockable method on the object.
Each LOCK operation looks like:
while (ob->owner_thread != self_thread()) {
unlock_mutex(thread_mutex[self_thread()])
// wait for owning thread to go to quiscent state
lock_mutex(thread_mutex[ob->owner_thread])
ob->owner_thread = self_thread()
unlock_mutex(thread_mutex[ob->owner_thread])
lock_mutex(thread_mutex[self_thread()])
}
Unlock is not done - we own the object now and can use it without locking
(until we return to interpreter loop or we call LOCK on other object).
For non-shared objects there is only penalty of ob->owner_thread != self_thread()
condition. Not sure about Windows, but in recent Linuxes one can use %gs register
as thread id, thus compare is about 3 cycles (and owner_thread should be
in object's cacheline anyway).
In contended case there is some cache pingpong with ob and mutex but it is
as expected.
Deadlocks
Our object ownership is long - from getting it in LOCK to next quiscent state
of the thread. Thus when two threads want to step each on other's object, they
will deadlock. Simple solution is to extend set of quiscent states.
It is when thread releases its thread_mutex in main loop (and immediately
reacquires). Additionaly it can release it just before it is going to wait
on another thread's mutex, like in LOCK (already in code above). If you use
LOCK correctly then when you are LOCKing an object you can't be in vulnerable
part of OTHER object. So that let other threads to get ownership of your own
objects in that time.
One can also want to release his lock when going to lock mutex in threading
package and in other places where GIL is released today.
However I admit that I did no formal proof regarding deadlock, I plan
to do it if nobody can find other flaw in the proposal.
Big reader lock
While above scheme might work well, it'd impose performance penalty
for shared dicts which are almost read only (module.__dict__).
For these similar locking can be used, only writer has to wait until
ALL other threads enter quiscent state (take locks of them), then perform
change and unlock them all. Readers can read without any locking.
Compatibilty with 3rd party modules
I've read this argument on pydev list. Maybe I'm not understanding something,
but is it so complex for Py_InitModule4 to use extra flag in apiver for example ?
When at least one non-freethreaded module is loaded, locking is done in
old good way...
More information about the Python-Dev
mailing list