
Probably I want to re-invent a bicycle. I want developers to say me why we can not remove GIL in that way: 1. Remove GIL completely with all current logick. 2. Add it's own RW-locking to all mutable objects (like list or dict) 3. Add RW-locks to every context instance 4. use RW-locks when accessing members of object instances Only one reason, I see, not do that -- is performance of singlethreaded applications. Why not to fix locking functions for this 4 cases to stubs when only one thread present? For atomicity, locks may be implemented as this: For example for this source: -------------------------------- import threading def x(): i=1000 while i: i-- a = threading.Thread(target=x) b = threading.Thread(target=x) a.start() b.start() a.join() b.join() -------------------------------- in my case it will be fully parallel, as common object is not locked much (only global context when a.xxxx = yyyy executed). I think, performance of such code will be higher that using GIL. Other significant reason of not using my case, as I think, is a plenty of atomic processor instructions in each thread, which affect kernel performance. Also, I know about incompatibility my variant with existing code. In a summary: Please say clearly why, actually, my variant is not still implemented. Thanks. -- Segmentation fault

Den 09.08.2011 11:33, skrev Марк Коренберг:
This has been discussed to death before, and is probably OT to this list. There is another reason than speed of single-threaded applications, but it is rather technical: As CPython uses reference counting for garbage collection, we would get "false sharing" of reference counts -- which would work as an "invisible GIL" (synchronization bottleneck) anyway. That is, if one processor writes to memory in a cache-line shared by another processor, they must stop whatever they are doing to synchronize the dirty cache lines with RAM. Thus, updating reference counts would flood the memory bus with traffic and be much worse than the GIL. Instead of doing useful work, the processors would be stuck synchronizing dirty cache lines. You can think of it as a severe traffic jam. To get rid of the GIL, CPython would either need (a) another GC method (e.g. similar to .NET or Java) or (b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains). As CPython has neither, we are better off with the GIL. Nobody likes the GIL, fork a project to write a GIL free CPython if you can. But note that: 1. With Cython, you have full manual control over the GIL. IronPython and Jython does not have a GIL at all. 2. Much of the FUD against the GIL is plain ignorance: The GIL slows down parallel computational code, but any serious number crunching should use numerical performance libraries (i.e. C extensions) anyway. Libraries are free to release the GIL or spawn threads internally. Also, the GIL does not matter for (a) I/O bound code such as network servers or clients and (b) background threads in GUI programs -- which are the two common use-cases for threads in Python programs. If the GIL bites you, it's most likely a warning that your program is badly written, independent of the GIL issue. There seems to be a common misunderstanding that Python threads work like fibers due to they GIL. They do not! Python threads are native OS threads and can do anything a thread can do, including executing library code in parallel. If one thread is blocking on I/O, the other threads can continue with their business. The only thing Python threads cannot do is access the Python interpreter concurrently. And the reason CPython needs that restriction is reference counting. Sturla

My two danish kroner on GIL issues…. I think I understand the background and need for GIL. Without it Python programs would have been cluttered with lock/synchronized statements and C-extensions would be harder to write. Thanks to Sturla Molden for he's explanation earlier in this thread. However, the GIL is also from a time, where single threaded programs running in single core CPU's was the common case. On a new MacBook Pro I have 8 core's and would expect my multithreaded Python program to run significantly fast than on a one-core CPU. Instead the program slows down to a much worse performance than on a one-core CPU. (Have a look at David Beazley's excellent talk on PyCon 2010 and he's paper http://www.dabeaz.com/GIL/ and http://blip.tv/carlfk/mindblowing-python-gil-2243379) For my viewpoint the multicore performance problems is the primary problem with the GIL, event though the other issues pointed out are valid. I still believe that the solution for Python would be to have an "every object is a thread/coroutine" solution a'la - ABCL (http://en.wikipedia.org/wiki/Actor-Based_Concurrent_Language) and - COOC (Concurrent Object Oriented C, (ftp://tsbgw.isl.rdc.toshiba.co.jp/pub/toshiba/cooc-beta.1.1.tar.Z) at least looked into as a alternative to a STM solution. But, my head is not big enough to fully understand this :-) kind regards /rene

Den 12.08.2011 18:57, skrev Rene Nejsum:
I doesn't seem I managed to explain it :( Yes, C extensions would be cluttered with synchronization statements, and that is annoying. But that was not my point all! Even with fine-grained locking in place, a system using reference counting will not scale on an multi-processor computer. Cache-lines containing reference counts will become incoherent between the processors, causing traffic jam on the memory bus. The technical term in parallel computing litterature is "false sharing".
A multi-threaded program can be slower on a multi-processor computer as well, if it suffered from extensive "false sharing" (which Python programs nearly always will do). That is, instead of doing useful work, the processors are stepping on each others toes. So they spend the bulk of the time synchronizing cache lines with RAM instead of computing. On a computer with a single processor, there cannot be any false sharing. So even without a GIL, a multi-threaded program can often run faster on a single-processor computer. That might seem counter-intuitive at first. I seen this "inversed scaling" blamed on the GIL many times, but it's dead wrong. Multi-threading is hard to get right, because the programmer must ensure that processors don't access the same cache lines. This is one of the reasons why numerical programs based on MPI (multiple processes and IPC) are likely to perform better than numerical programs based on OpenMP (multiple threads and shared memory). As for Python, it means that it is easier to make a program based on multiprocessing scale well on a multi-processor computer, than a program based on threading and releasing the GIL. And that has nothing to do with the GIL! Albeit, I'd estimate 99% of Python programmers would blame it on the GIL. It has to do with what shared memory does if cache lines are shared. Intuition about what affects the performance of a multi-threaded program is very often wrong. If one needs parallel computing, multiple processes is much more likely to scale correctly. Threads are better reserved for things like non-blocking I/O. The problem with the GIL is merely what people think it does -- not what it actually does. It is so easy to blame a performance issue on the GIL, when it is actually the use of threads and shared memory per se that is the problem. Sturla

On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum <rene@stranden.com> wrote:
No, sorry, the first half of this is incorrect: with or without the GIL *Python* code would need the same amount of fine-grained locking. (The part about C extensions is correct.) I am butting in because this is a common misunderstanding that really needs to be squashed whenever it is aired -- the GIL does *not* help Python code to synchronize. A thread-switch can occur between any two bytecode opcodes. Without the GIL, atomic operations (e.g. dict lookups that doesn't require evaluation of __eq__ or __hash__ implemented in Python) are still supposed to be atomic. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 12.08.2011 23:38:
And in this context, it's worth mentioning that even C code can be bitten by the GIL being temporarily released when calling back into the interpreter. Only plain C code sequences safely keep the GIL, including many (but not all) calls to the C-API. Stefan

On Sat, Aug 13, 2011 at 2:12 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
And, though mostly off-topic, the worst problem with C code, calling back into Python, and the GIL that I have seen (several times): Suppose you are calling some complex C library that creates threads itself, where those threads may also call back into Python. Here you have to put a block around each Python callback that acquires the GIL before and releases it after, since the new threads (created by C code) start without the GIL acquired. I remember a truly nasty incident where the latter was done, but the main thread did not release the GIL since it was returning directly to Python (which would of course release the GIL every so many opcodes so the callbacks would run). But under certain conditions the block with the acquire-release-GIL code around a Python callback was invoked in the main thread (when a validation problem was detected early), and since the main thread didn't release the GIL around the call into the C code, it hung in a nasty spot. Add many layers of software, and a hard-to-reproduce error condition that triggers this, and you have a problem that's very hard to debug... -- --Guido van Rossum (python.org/~guido)

On Sat, 13 Aug 2011 09:08:16 -0400 Guido van Rossum <guido@python.org> wrote:
These days we have PyGILState_Ensure(): http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure and even dedicated documentation: http://docs.python.org/dev/c-api/init.html#non-python-created-threads ;) Regards Antoine.

On Sun, Aug 14, 2011 at 9:26 AM, Guido van Rossum <guido@python.org> wrote:
Although, if it's possible to arrange it, it's still better to do that once and then use BEGIN/END_ALLOW_THREADS to avoid the overhead of creating and destroying the temporary thread states: http://blog.ccpgames.com/kristjan/2011/06/23/temporary-thread-state-overhead... Still, it's far, far easier than it used to be to handle the GIL correctly from non-Python created threads. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Den 13.08.2011 17:43, skrev Antoine Pitrou:
These days we have PyGILState_Ensure(): http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure
With the most recent Cython (0.15) we can just do: with gil: <suite> to ensure holding the GIL. And similarly from a thread holding the GIL with nogil: <suite> to temporarily release it. There are also some OpenMP support in Cython 0.15. OpenMP is much easier than messing around with threads manually (it moves all the hard parts of multithreading to the compiler). Now Cython almost makes it look Pythonic: http://docs.cython.org/src/userguide/parallelism.html Sturla

On 2011-08-11, at 21:11 , Sturla Molden wrote:
(b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains).
Nitpick: this is not correct re. erlang. While it is correct that it uses "another threading model" (one could even say "no threading model"), it's not a "one interpreter per thread" model at all: * Erlang uses "erlang processes", which are very cheap preempted *processes* (no shared memory). There have always been tens to thousands to millions of erlang processes per interpreter * A long time ago (before 2006 and the SMP VM, that was R11B) the erlang VM was single-threaded, so all those erlang processes ran in a single OS thread. To use multiple OS threads one had to create an erlang cluster (start multiple VMs and distribute spawned processes over those). However, this was already an m:n model, there were multiple erlang processes for each VM. * Since the introduction of the SMP VM, the erlang interpreter can create multiple *schedulers* (one per physical core by default), with each scheduler running in its own OS thread. In this model, there's a single interpreter and an m:n mapping of erlang processes to OS threads within that single interpreter. (interestingly, because -smp generates resource contention within the interpreter going back to pre-SMP by setting the number of schedulers per node to 1 can yield increased overall performances)

Den 12.08.2011 18:51, skrev Xavier Morel:
Technically, one can make threads behave like processes if they don't share memory pages (though they will still share address space). Erlangs use of 'process' instead of 'thread' does not mean an Erlang process has to be implemented as an OS process. With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. On Windows, there is an API function called HeapAlloc, which lets us allocate memory form a dedicated heap. The common use case is to prevent threads from sharing memory, thus behaving like light-weight processes (except address space is shared). On Unix, is is more common to use fork() to create new processes instead, as processes are more light-weight than on Windows. Sturla

Even in the Erlang model, the afore-mentioned issues of bus contention put a cap on the number of threads you can run in any given application assuming there's any amount of cross-thread synchronization. I wrote a blog post on this subject with respect to my experience in tuning RabbitMQ on NUMA architectures. http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/ It should be noted that Erlang processes are not the same as OS processes. They are more akin to green threads, scheduled on N number of legit OS threads which are in turn run on C number of cores. The end effect is the same though, as the data is effectively shared across NUMA nodes, which runs into basic physical constraints. I used to think the GIL was a major bottleneck, and though I'm not fond of it, my recent experience has highlighted that *any* application which uses shared memory will have significant bus contention when scaling across all cores. The best course of action is shared-nothing MPI style, but in 64bit land, that can mean significant wasted address space. <http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/> -Aaron On Fri, Aug 12, 2011 at 2:59 PM, Sturla Molden <sturla@molden.no> wrote:

On 2011-08-12, at 20:59 , Sturla Molden wrote:
With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. Again, my point is that Erlang does not work "with one interpreter per thread". Which was your claim.

Den 09.08.2011 11:33, skrev Марк Коренберг:
This has been discussed to death before, and is probably OT to this list. There is another reason than speed of single-threaded applications, but it is rather technical: As CPython uses reference counting for garbage collection, we would get "false sharing" of reference counts -- which would work as an "invisible GIL" (synchronization bottleneck) anyway. That is, if one processor writes to memory in a cache-line shared by another processor, they must stop whatever they are doing to synchronize the dirty cache lines with RAM. Thus, updating reference counts would flood the memory bus with traffic and be much worse than the GIL. Instead of doing useful work, the processors would be stuck synchronizing dirty cache lines. You can think of it as a severe traffic jam. To get rid of the GIL, CPython would either need (a) another GC method (e.g. similar to .NET or Java) or (b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains). As CPython has neither, we are better off with the GIL. Nobody likes the GIL, fork a project to write a GIL free CPython if you can. But note that: 1. With Cython, you have full manual control over the GIL. IronPython and Jython does not have a GIL at all. 2. Much of the FUD against the GIL is plain ignorance: The GIL slows down parallel computational code, but any serious number crunching should use numerical performance libraries (i.e. C extensions) anyway. Libraries are free to release the GIL or spawn threads internally. Also, the GIL does not matter for (a) I/O bound code such as network servers or clients and (b) background threads in GUI programs -- which are the two common use-cases for threads in Python programs. If the GIL bites you, it's most likely a warning that your program is badly written, independent of the GIL issue. There seems to be a common misunderstanding that Python threads work like fibers due to they GIL. They do not! Python threads are native OS threads and can do anything a thread can do, including executing library code in parallel. If one thread is blocking on I/O, the other threads can continue with their business. The only thing Python threads cannot do is access the Python interpreter concurrently. And the reason CPython needs that restriction is reference counting. Sturla

My two danish kroner on GIL issues…. I think I understand the background and need for GIL. Without it Python programs would have been cluttered with lock/synchronized statements and C-extensions would be harder to write. Thanks to Sturla Molden for he's explanation earlier in this thread. However, the GIL is also from a time, where single threaded programs running in single core CPU's was the common case. On a new MacBook Pro I have 8 core's and would expect my multithreaded Python program to run significantly fast than on a one-core CPU. Instead the program slows down to a much worse performance than on a one-core CPU. (Have a look at David Beazley's excellent talk on PyCon 2010 and he's paper http://www.dabeaz.com/GIL/ and http://blip.tv/carlfk/mindblowing-python-gil-2243379) For my viewpoint the multicore performance problems is the primary problem with the GIL, event though the other issues pointed out are valid. I still believe that the solution for Python would be to have an "every object is a thread/coroutine" solution a'la - ABCL (http://en.wikipedia.org/wiki/Actor-Based_Concurrent_Language) and - COOC (Concurrent Object Oriented C, (ftp://tsbgw.isl.rdc.toshiba.co.jp/pub/toshiba/cooc-beta.1.1.tar.Z) at least looked into as a alternative to a STM solution. But, my head is not big enough to fully understand this :-) kind regards /rene

Den 12.08.2011 18:57, skrev Rene Nejsum:
I doesn't seem I managed to explain it :( Yes, C extensions would be cluttered with synchronization statements, and that is annoying. But that was not my point all! Even with fine-grained locking in place, a system using reference counting will not scale on an multi-processor computer. Cache-lines containing reference counts will become incoherent between the processors, causing traffic jam on the memory bus. The technical term in parallel computing litterature is "false sharing".
A multi-threaded program can be slower on a multi-processor computer as well, if it suffered from extensive "false sharing" (which Python programs nearly always will do). That is, instead of doing useful work, the processors are stepping on each others toes. So they spend the bulk of the time synchronizing cache lines with RAM instead of computing. On a computer with a single processor, there cannot be any false sharing. So even without a GIL, a multi-threaded program can often run faster on a single-processor computer. That might seem counter-intuitive at first. I seen this "inversed scaling" blamed on the GIL many times, but it's dead wrong. Multi-threading is hard to get right, because the programmer must ensure that processors don't access the same cache lines. This is one of the reasons why numerical programs based on MPI (multiple processes and IPC) are likely to perform better than numerical programs based on OpenMP (multiple threads and shared memory). As for Python, it means that it is easier to make a program based on multiprocessing scale well on a multi-processor computer, than a program based on threading and releasing the GIL. And that has nothing to do with the GIL! Albeit, I'd estimate 99% of Python programmers would blame it on the GIL. It has to do with what shared memory does if cache lines are shared. Intuition about what affects the performance of a multi-threaded program is very often wrong. If one needs parallel computing, multiple processes is much more likely to scale correctly. Threads are better reserved for things like non-blocking I/O. The problem with the GIL is merely what people think it does -- not what it actually does. It is so easy to blame a performance issue on the GIL, when it is actually the use of threads and shared memory per se that is the problem. Sturla

On Fri, Aug 12, 2011 at 12:57 PM, Rene Nejsum <rene@stranden.com> wrote:
No, sorry, the first half of this is incorrect: with or without the GIL *Python* code would need the same amount of fine-grained locking. (The part about C extensions is correct.) I am butting in because this is a common misunderstanding that really needs to be squashed whenever it is aired -- the GIL does *not* help Python code to synchronize. A thread-switch can occur between any two bytecode opcodes. Without the GIL, atomic operations (e.g. dict lookups that doesn't require evaluation of __eq__ or __hash__ implemented in Python) are still supposed to be atomic. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 12.08.2011 23:38:
And in this context, it's worth mentioning that even C code can be bitten by the GIL being temporarily released when calling back into the interpreter. Only plain C code sequences safely keep the GIL, including many (but not all) calls to the C-API. Stefan

On Sat, Aug 13, 2011 at 2:12 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
And, though mostly off-topic, the worst problem with C code, calling back into Python, and the GIL that I have seen (several times): Suppose you are calling some complex C library that creates threads itself, where those threads may also call back into Python. Here you have to put a block around each Python callback that acquires the GIL before and releases it after, since the new threads (created by C code) start without the GIL acquired. I remember a truly nasty incident where the latter was done, but the main thread did not release the GIL since it was returning directly to Python (which would of course release the GIL every so many opcodes so the callbacks would run). But under certain conditions the block with the acquire-release-GIL code around a Python callback was invoked in the main thread (when a validation problem was detected early), and since the main thread didn't release the GIL around the call into the C code, it hung in a nasty spot. Add many layers of software, and a hard-to-reproduce error condition that triggers this, and you have a problem that's very hard to debug... -- --Guido van Rossum (python.org/~guido)

On Sat, 13 Aug 2011 09:08:16 -0400 Guido van Rossum <guido@python.org> wrote:
These days we have PyGILState_Ensure(): http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure and even dedicated documentation: http://docs.python.org/dev/c-api/init.html#non-python-created-threads ;) Regards Antoine.

On Sun, Aug 14, 2011 at 9:26 AM, Guido van Rossum <guido@python.org> wrote:
Although, if it's possible to arrange it, it's still better to do that once and then use BEGIN/END_ALLOW_THREADS to avoid the overhead of creating and destroying the temporary thread states: http://blog.ccpgames.com/kristjan/2011/06/23/temporary-thread-state-overhead... Still, it's far, far easier than it used to be to handle the GIL correctly from non-Python created threads. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Den 13.08.2011 17:43, skrev Antoine Pitrou:
These days we have PyGILState_Ensure(): http://docs.python.org/dev/c-api/init.html#PyGILState_Ensure
With the most recent Cython (0.15) we can just do: with gil: <suite> to ensure holding the GIL. And similarly from a thread holding the GIL with nogil: <suite> to temporarily release it. There are also some OpenMP support in Cython 0.15. OpenMP is much easier than messing around with threads manually (it moves all the hard parts of multithreading to the compiler). Now Cython almost makes it look Pythonic: http://docs.cython.org/src/userguide/parallelism.html Sturla

On 2011-08-11, at 21:11 , Sturla Molden wrote:
(b) another threading model (e.g. one interpreter per thread, as in Tcl, Erlang, or .NET app domains).
Nitpick: this is not correct re. erlang. While it is correct that it uses "another threading model" (one could even say "no threading model"), it's not a "one interpreter per thread" model at all: * Erlang uses "erlang processes", which are very cheap preempted *processes* (no shared memory). There have always been tens to thousands to millions of erlang processes per interpreter * A long time ago (before 2006 and the SMP VM, that was R11B) the erlang VM was single-threaded, so all those erlang processes ran in a single OS thread. To use multiple OS threads one had to create an erlang cluster (start multiple VMs and distribute spawned processes over those). However, this was already an m:n model, there were multiple erlang processes for each VM. * Since the introduction of the SMP VM, the erlang interpreter can create multiple *schedulers* (one per physical core by default), with each scheduler running in its own OS thread. In this model, there's a single interpreter and an m:n mapping of erlang processes to OS threads within that single interpreter. (interestingly, because -smp generates resource contention within the interpreter going back to pre-SMP by setting the number of schedulers per node to 1 can yield increased overall performances)

Den 12.08.2011 18:51, skrev Xavier Morel:
Technically, one can make threads behave like processes if they don't share memory pages (though they will still share address space). Erlangs use of 'process' instead of 'thread' does not mean an Erlang process has to be implemented as an OS process. With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. On Windows, there is an API function called HeapAlloc, which lets us allocate memory form a dedicated heap. The common use case is to prevent threads from sharing memory, thus behaving like light-weight processes (except address space is shared). On Unix, is is more common to use fork() to create new processes instead, as processes are more light-weight than on Windows. Sturla

Even in the Erlang model, the afore-mentioned issues of bus contention put a cap on the number of threads you can run in any given application assuming there's any amount of cross-thread synchronization. I wrote a blog post on this subject with respect to my experience in tuning RabbitMQ on NUMA architectures. http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/ It should be noted that Erlang processes are not the same as OS processes. They are more akin to green threads, scheduled on N number of legit OS threads which are in turn run on C number of cores. The end effect is the same though, as the data is effectively shared across NUMA nodes, which runs into basic physical constraints. I used to think the GIL was a major bottleneck, and though I'm not fond of it, my recent experience has highlighted that *any* application which uses shared memory will have significant bus contention when scaling across all cores. The best course of action is shared-nothing MPI style, but in 64bit land, that can mean significant wasted address space. <http://blog.agoragames.com/blog/2011/06/24/of-penguins-rabbits-and-buses/> -Aaron On Fri, Aug 12, 2011 at 2:59 PM, Sturla Molden <sturla@molden.no> wrote:

On 2011-08-12, at 20:59 , Sturla Molden wrote:
With one interpreter per thread, and a malloc that does not let threads share memory pages (one heap per thread), Python could do the same. Again, my point is that Erlang does not work "with one interpreter per thread". Which was your claim.
participants (11)
-
Aaron Westendorf
-
Antoine Pitrou
-
Greg Ewing
-
Guido van Rossum
-
Nick Coghlan
-
Rene Nejsum
-
Stefan Behnel
-
Sturla Molden
-
VanL
-
Xavier Morel
-
Марк Коренберг