
Thread synchronization with threading.Lock can be expensive. But consider this: Why should the active thread need to aquire a mutex, when it already holds one? That would be the GIL. Instead of acqiring a lock (and possibly inducing thread swiching etc.), it could just deny the other threads access to the GIL for a while. The cost of that synchronization method would be completely amortized, as check intervals happen anyway. Here is what a very naïve implementation would look like in ctypes (real code could use C instead, or perhaps should not attempt this at all...): from contextlib import contextmanager import ctypes _Py_Ticker = ctypes.c_int.in_dll(ctypes.pythonapi,"_Py_Ticker") @contextmanager def threadsafe(): tmp = _Py_Ticker.value _Py_Ticker.value = 0x7fffffff yield _Py_Ticker.value = tmp Now we can do this: with threadsafe(): # the GIL is mine, # for as long as I want pass The usecase for this "gillock" is about the same as for a spinlock C. We want synchronization for a breif period of time, but don't want the overhead of aquiring a mutex. In Python this gillock has one big advantage over a spinlock: We don't have to wait, so we don't risk a tread switch on __enter__/aquire. But there can be only one instance of this lock, as there is only one GIL. That is the drawback compared to a spinlock. Therefore I think both a spinlock and a gillock should be added to the threading module. These are synchronization methods that should be available. P.S. A gillock working like the ctypes code above is of course very dangerous. If a C extension releases the GIL while _Py_Ticker is astronomic, we have a very bad situation... But real code could try to safeguard against this. E.g. Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS could be defined to do nothing if _Py_Ticker is above some ridiculous threshold. P.P.S. Yes I know about the newgil. But I have not thought about how to achieve similar effect with that. Sturla

On Tue, Jul 20, 2010 at 8:51 PM, Sturla Molden <sturla@molden.no> wrote:
[...snip...]
Regardless of the rest of the proposal (which I'm not keen on; I don't want to rely on the GIL "being there") you need to factor the newgil code into this equation as a change like this would only land in the 3.x branch. With 2.7 out the door - 3.x is, for all effective purposes, trunk. jesse

Den 21.07.2010 02:57, skrev Jesse Noller:
Yes, but the principle of denying other threads access to the GIL for fast synchronization still applies to 3.x. Java has "synchronized" blocks too. This is not very different. But monopolizing the GIL is much faster than a lock if the purpose is just to guard a tiny piece of code. Sturla

Den 21.07.2010 02:57, skrev Jesse Noller:
I have looked briefly at it now. It seems to be just as easy with newgil, and possibly much safer. The functions drop_gil and take_gil in ceval_gil.h could e.g. be modified to just return and do nothing if a global deny flag is set. But I have to look more carefully at this later on. Sturla

On Tue, Jul 20, 2010 at 9:40 PM, Sturla Molden <sturla@molden.no> wrote:
That's all I was trying to say :) Your original email said "Yes I know about the newgil. But I have not thought about how to achieve similar effect with that." I see what you're trying to do (impersonating the synchronized keyword java has) but I'm a little skeeved out by adding anything like this which is directly reliant on the GIL's existence.

Den 21.07.2010 03:58, skrev Jesse Noller:
It is not reliant on the GIL. Sorry if you got that impression. In a GIL free world, a global spinlock would serve the same purpose (cf. Java). But as there is a GIL in Python, we cannot use a spinlock to avoid the overhead of a mutex. But we can temporarily hold on to the GIL instead, and achieve the same effect. This is very close to Java's synchronized keyword, yes. The main reason is that most synchronizations in multi-threaded apps are used to protect very small pieces of code. A mutex is overkill for that. Instead of: 1. release gil. 2. acquire lock. 3. re-acquire gil. 4. release lock. we could just: 1. keep the gil for a litte while. Also note that in the single-threaded case, the overhead from this "synchronized" block would be close to zero. It would do nothing, except write to the address of a volatile int. Sturla

Den 21.07.2010 02:51, skrev Sturla Molden:
Avtually, a spinlock would probably not even be feasible in Python: The GIL is a mutex, and we would have to give it up before we could spin on the lock, and reacquire afterwards. So the cost of a kernel mutex object is still there. The gillock is possibly the only way of getting a fast lock object in Python. Sturla

On 21Jul2010 02:51, Sturla Molden <sturla@molden.no> wrote: | Thread synchronization with threading.Lock can be expensive. But | consider this: Why should the active thread need to aquire a mutex, | when it already holds one? That would be the GIL. | | Instead of acqiring a lock (and possibly inducing thread swiching | etc.), it could just deny the other threads access to the GIL for a | while. The cost of that synchronization method would be completely | amortized, as check intervals happen anyway. [...] | The usecase for this "gillock" is about the same as for a spinlock | C. We want synchronization for a breif period of time, but don't | want the overhead of aquiring a mutex. But a spinlock _does_ acquire a mutex, unless I misremember. It is just that instead of using a more expensive mutex mechanism that deschedules the caller until available it sits there banging its head against a very lightweight (not descheduling) mutex until it gets it; the efficiency is entirely reliant on _all_ users of the spinlock doing very little inside the located period. And therein lies one of my two misgivings about this idea: users of this must be very careful at all times to do Very Little inside the lock. Just like a spinlock. This seems very easy to misuse. The other misgiving is one you've already mentioned in passing: P.S. A gillock working like the ctypes code above is of course very dangerous. If a C extension releases the GIL while _Py_Ticker is astronomic, we have a very bad situation... But real code could try to safeguard against this. E.g. Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS could be defined to do nothing if _Py_Ticker is above some ridiculous threshold. Suppose someone goes: with threadsafe(): x = fp.read(1) and the read blocks? It looks like trivial code but I think the I/O stuff will release the GIL over a read. Now you're not safe any more. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The ZZR-1100 is not the bike for me, but the day they invent "nerf" roads and ban radars I'll be the first in line......AMCN

On Wed, 21 Jul 2010 16:22:27 +1000 Cameron Simpson <cs@zip.com.au> wrote:
Not only that, but optimized OSes and libc's might do the same as an optimization for regular mutexes. For example, here's what the pthreads(7) man page says here: “thread synchronization primitives (mutexes, thread joining, etc.) are implemented using the Linux futex(2) system call.” I don't know exactly how futex(2) works, but it looks like a kind of low-level API allowing to build spinlocks and other fast userspace primitives. Regards Antoine.

On Wed, 21 Jul 2010 02:51:53 +0200 Sturla Molden <sturla@molden.no> wrote:
Ok, here is the cost of acquiring and releasing an uncontended lock under Linux, with Python 3.2: $ ./python -m timeit \ -s "from threading import Lock; l=Lock(); a=l.acquire; r=l.release" \ "a(); r()" 10000000 loops, best of 3: 0.127 usec per loop And here is the cost of calling a dummy Python function: $ ./python -m timeit -s "def a(): pass" "a(); a()" 1000000 loops, best of 3: 0.221 usec per loop And here is the cost of calling a trivial C function (which returns the False singleton): $ ./python -m timeit -s "a=bool" "a(); a()" 10000000 loops, best of 3: 0.164 usec per loop Also, note that using the lock as a context manager is actually slower, not faster as you might imagine: $ ./python -m timeit -s "from threading import Lock; l=Lock()" \ "with l: pass" 1000000 loops, best of 3: 0.242 usec per loop At least under Linux, there doesn't seem to be a lot of room for improvement in lock performance, to say the least. PS: RLock is now as fast as Lock: $ ./python -m timeit \ -s "from threading import RLock; l=RLock(); a=l.acquire; r=l.release" \ "a(); r()" 10000000 loops, best of 3: 0.114 usec per loop

On Tue, Jul 20, 2010 at 8:51 PM, Sturla Molden <sturla@molden.no> wrote:
[...snip...]
Regardless of the rest of the proposal (which I'm not keen on; I don't want to rely on the GIL "being there") you need to factor the newgil code into this equation as a change like this would only land in the 3.x branch. With 2.7 out the door - 3.x is, for all effective purposes, trunk. jesse

Den 21.07.2010 02:57, skrev Jesse Noller:
Yes, but the principle of denying other threads access to the GIL for fast synchronization still applies to 3.x. Java has "synchronized" blocks too. This is not very different. But monopolizing the GIL is much faster than a lock if the purpose is just to guard a tiny piece of code. Sturla

Den 21.07.2010 02:57, skrev Jesse Noller:
I have looked briefly at it now. It seems to be just as easy with newgil, and possibly much safer. The functions drop_gil and take_gil in ceval_gil.h could e.g. be modified to just return and do nothing if a global deny flag is set. But I have to look more carefully at this later on. Sturla

On Tue, Jul 20, 2010 at 9:40 PM, Sturla Molden <sturla@molden.no> wrote:
That's all I was trying to say :) Your original email said "Yes I know about the newgil. But I have not thought about how to achieve similar effect with that." I see what you're trying to do (impersonating the synchronized keyword java has) but I'm a little skeeved out by adding anything like this which is directly reliant on the GIL's existence.

Den 21.07.2010 03:58, skrev Jesse Noller:
It is not reliant on the GIL. Sorry if you got that impression. In a GIL free world, a global spinlock would serve the same purpose (cf. Java). But as there is a GIL in Python, we cannot use a spinlock to avoid the overhead of a mutex. But we can temporarily hold on to the GIL instead, and achieve the same effect. This is very close to Java's synchronized keyword, yes. The main reason is that most synchronizations in multi-threaded apps are used to protect very small pieces of code. A mutex is overkill for that. Instead of: 1. release gil. 2. acquire lock. 3. re-acquire gil. 4. release lock. we could just: 1. keep the gil for a litte while. Also note that in the single-threaded case, the overhead from this "synchronized" block would be close to zero. It would do nothing, except write to the address of a volatile int. Sturla

Den 21.07.2010 02:51, skrev Sturla Molden:
Avtually, a spinlock would probably not even be feasible in Python: The GIL is a mutex, and we would have to give it up before we could spin on the lock, and reacquire afterwards. So the cost of a kernel mutex object is still there. The gillock is possibly the only way of getting a fast lock object in Python. Sturla

On 21Jul2010 02:51, Sturla Molden <sturla@molden.no> wrote: | Thread synchronization with threading.Lock can be expensive. But | consider this: Why should the active thread need to aquire a mutex, | when it already holds one? That would be the GIL. | | Instead of acqiring a lock (and possibly inducing thread swiching | etc.), it could just deny the other threads access to the GIL for a | while. The cost of that synchronization method would be completely | amortized, as check intervals happen anyway. [...] | The usecase for this "gillock" is about the same as for a spinlock | C. We want synchronization for a breif period of time, but don't | want the overhead of aquiring a mutex. But a spinlock _does_ acquire a mutex, unless I misremember. It is just that instead of using a more expensive mutex mechanism that deschedules the caller until available it sits there banging its head against a very lightweight (not descheduling) mutex until it gets it; the efficiency is entirely reliant on _all_ users of the spinlock doing very little inside the located period. And therein lies one of my two misgivings about this idea: users of this must be very careful at all times to do Very Little inside the lock. Just like a spinlock. This seems very easy to misuse. The other misgiving is one you've already mentioned in passing: P.S. A gillock working like the ctypes code above is of course very dangerous. If a C extension releases the GIL while _Py_Ticker is astronomic, we have a very bad situation... But real code could try to safeguard against this. E.g. Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS could be defined to do nothing if _Py_Ticker is above some ridiculous threshold. Suppose someone goes: with threadsafe(): x = fp.read(1) and the read blocks? It looks like trivial code but I think the I/O stuff will release the GIL over a read. Now you're not safe any more. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The ZZR-1100 is not the bike for me, but the day they invent "nerf" roads and ban radars I'll be the first in line......AMCN

On Wed, 21 Jul 2010 16:22:27 +1000 Cameron Simpson <cs@zip.com.au> wrote:
Not only that, but optimized OSes and libc's might do the same as an optimization for regular mutexes. For example, here's what the pthreads(7) man page says here: “thread synchronization primitives (mutexes, thread joining, etc.) are implemented using the Linux futex(2) system call.” I don't know exactly how futex(2) works, but it looks like a kind of low-level API allowing to build spinlocks and other fast userspace primitives. Regards Antoine.

On Wed, 21 Jul 2010 02:51:53 +0200 Sturla Molden <sturla@molden.no> wrote:
Ok, here is the cost of acquiring and releasing an uncontended lock under Linux, with Python 3.2: $ ./python -m timeit \ -s "from threading import Lock; l=Lock(); a=l.acquire; r=l.release" \ "a(); r()" 10000000 loops, best of 3: 0.127 usec per loop And here is the cost of calling a dummy Python function: $ ./python -m timeit -s "def a(): pass" "a(); a()" 1000000 loops, best of 3: 0.221 usec per loop And here is the cost of calling a trivial C function (which returns the False singleton): $ ./python -m timeit -s "a=bool" "a(); a()" 10000000 loops, best of 3: 0.164 usec per loop Also, note that using the lock as a context manager is actually slower, not faster as you might imagine: $ ./python -m timeit -s "from threading import Lock; l=Lock()" \ "with l: pass" 1000000 loops, best of 3: 0.242 usec per loop At least under Linux, there doesn't seem to be a lot of room for improvement in lock performance, to say the least. PS: RLock is now as fast as Lock: $ ./python -m timeit \ -s "from threading import RLock; l=RLock(); a=l.acquire; r=l.release" \ "a(); r()" 10000000 loops, best of 3: 0.114 usec per loop
participants (4)
-
Antoine Pitrou
-
Cameron Simpson
-
Jesse Noller
-
Sturla Molden