Mailman 3 Remove GIL with CAS instructions? - Python-ideas

Remove GIL with CAS instructions?

Sturla Molden

20 Oct 2009 20 Oct '09

9:24 p.m.

The GIL is particularly evil on multiprocessor systems. Not just because it prevents parallel excution of Python code, but thread switches are defunct. This is because a thread that periodically releases and re-acquires the GIL (at checkintervals), more or less always wins. On a multiprocessor system, the thread it will continue to run its processor and grab the GIL back before a sleeping thread wakes up. The cooperative multithreading intended by using checkintervals only works properly on single-processor systems. Python threads only work properly on single-processor computers. On computers with multiple CPUs, one must set the affinity of the Python process is set to one CPU only for proper cooperative thread scheduling. This behaviour of the GIL is a PITA, but often overlooked - all debates seems to be about parallel scalability, though far less important. Today, most people have multicore desktop computers (e.g. mine have four cores). The Linux kernel got rid of the BKL a long time ago. It's time Python does the same. The GIL also has its virtues. When calling a C extension, it is serialized by default unless the GIL is released. An attempt was made (many years ago) to replace the GIL with fine-grained locking. It slowed down serial code. That should come as no surprise, as OS mutexes and semaphores are expensive. There are lock-free data structures of any conceivable sort. There are multithreaded garbage collectors in Java and .NET, that don't need a global lock. I've heard claims that removal of the GIL would require Python to use a garbage collector instead of reference counts. That is not true. Lock-free data structures and garbage collectors have all one thing in common: they use a compare-and-exchange (CAS) instruction present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is used to implement OS mutexes (sleeplocks) and the less expensive spinlocks. Without a CAS instruction for the platform, there would be no GIL. All platforms that has a working GIL has a working CAS. Python could use the CAS instruction for lock-free management of reference counts. All built-in types (int, float, str, list, tuple, dict, set, etc) could be made lock-free using CAS. That would allow the interpreter to execute bytecode without holding the GIL. Only calls to C extensions would be serialized with the GIL. Schematically, this is how refcounts could be managed without needing a lock: /* * Returns 1 if an exhange was made, and (*addr == old) before * the exchange, otherwise returns 0. * * uses opcode CMPXCHG on x86 * */ extern int Py_refcnt_CAS(Py_ssize_t *addr, Py_ssize_t old, Py_ssize_t new); inline void Py_DECREF(PyObject *obj) { register Py_ssize_t refcnt; do { refcnt = obj->refcnt; } while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt-1)); refcnt = obj->refcnt; if (refcnt == 0) { if(Py_refcnt_CAS(&obj->ob_refcnt, 0, -1)) { /* deallocate the object */ } } } inline void Py_INCREF(PyObject *obj) { register Py_ssize_t refcnt; refcnt = obj->refcnt; if (refcnt >= 0) { do { if((refcnt = obj->refcnt)<0) break; } while (!Py_refcnt_CAS(&obj->ob_refcnt, refcnt, refcnt+1)); } } Regards, Sturla Molden

Show replies by date

Benjamin Peterson

20 Oct 20 Oct

9:45 p.m.

Sturla Molden <sturla@...> writes:

...

Python could use the CAS instruction for lock-free management of reference counts. All built-in types (int, float, str, list, tuple, dict, set, etc) could be made lock-free using CAS.

I eagerly await your patch!

Sturla Molden

10:13 p.m.

Benjamin Peterson skrev:

...

I eagerly await your patch!

I'd gladly contribute to the effort, but I did not say it is an easy task :-) I just wanted to bring this up: - The GIL has consequences on multicore CPUs that are overlooked: thread switches are usually missed at check intervals. This could be fixed without removing the GIL: For example, there could be a wait-queue for the GIL; a thread that request the GIL puts itself in the back. - Also, fine grained locking is not the only alternative to a global lock. No locks at all is even better. Sturla Molden

Casey Duncan

10:57 p.m.

On Oct 20, 2009, at 4:13 PM, Sturla Molden wrote:

...

Benjamin Peterson skrev:

...
I eagerly await your patch! I'd gladly contribute to the effort, but I did not say it is an easy task :-)

I just wanted to bring this up: [..]

- Also, fine grained locking is not the only alternative to a global lock. No locks at all is even better.

I doubt anyone here would disagree with that, but to be taken seriously, you'll need to be *much* more specific. Enlighten us. -Casey

Sturla Molden

11:31 p.m.

Casey Duncan skrev:

...

I doubt anyone here would disagree with that, but to be taken seriously, you'll need to be *much* more specific. Enlighten us.

I just showed you thread-safe incref and decref that don't require a lock. How specific should I be? S.M.

Casey Duncan

11:38 p.m.

On Oct 20, 2009, at 5:31 PM, Sturla Molden wrote:

...

Casey Duncan skrev:

...
I doubt anyone here would disagree with that, but to be taken seriously, you'll need to be *much* more specific. Enlighten us.

I just showed you thread-safe incref and decref that don't require a lock.

They looked like spinlocks to me, I was assuming you were talking about something else even more magical. I was still considering them locks. -Casey

Sturla Molden

11:52 p.m.

Casey Duncan skrev:

...

They looked like spinlocks to me, I was assuming you were talking about something else even more magical. I was still considering them locks.

The code is almost like spinlocks, except the refcount acts as its own "spinlock", thus the overhead is halved. (A separate spinlock would have to be released as well.) The failed attempt to remove the GIL used OS mutexes, which are much more expensive sleeplocks.

Kristján Valur Jónsson

21 Oct 21 Oct

10:15 a.m.

...

-----Original Message----- From: python-ideas-bounces+kristjan=ccpgames.com@python.org [mailto:python-ideas-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Sturla Molden Sent: 20. október 2009 22:13 To: python-ideas@python.org Subject: Re: [Python-ideas] Remove GIL with CAS instructions?

- The GIL has consequences on multicore CPUs that are overlooked: thread switches are usually missed at check intervals. This could be fixed without removing the GIL: For example, there could be a wait-queue for the GIL; a thread that request the GIL puts itself in the back.

This depends entirely on the platform and primitives used to implement the GIL. I'm interested in windows. There, I found this article: http://fonp.blogspot.com/2007/10/fairness-in-win32-lock-objects.html So, you may be on to something. Perhaps a simple C test is in order then? I did that. I found, on my dual-core vista machine, that running "release", that both Mutexes and CriticalSections behaved as you describe, with no "fairness". Using a "semaphore" seems to retain fairness, however. "fairness" was retained in debug builds too, strangely enough. Now, Python uses none of these. On windows, it uses an "Event" object coupled with an atomically updated counter. This also behaves fairly. The test application is attached. I think that you ought to sustantiate your claims better, maybe with a specific platform and using some test like the above. On the other hand, it shows that we must be careful what we use. There has been some talk of using CriticalSections for the GIL on windows. This test ought to show the danger of that. The GIL is different than a regular lock. It is a reverse-lock, really, and therefore may need to be implemented in its own special way, if we want very fast mutexes for the rest of the system (cc to python-dev) Cheers, Kristján

Nick Coghlan

12:31 p.m.

Sturla Molden wrote:

...

Benjamin Peterson skrev:

...
I eagerly await your patch! I'd gladly contribute to the effort, but I did not say it is an easy task :-)

I just wanted to bring this up:

- The GIL has consequences on multicore CPUs that are overlooked: thread switches are usually missed at check intervals.

In the rare case where none of those threads are releasing the GIL for any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart, but not a particularly difficult one to deal with. That tangent aside, the major problem with removing the GIL has never been the concept of doing so, but the practicality of implementing it in a cross-platform way that doesn't hurt single threaded performance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Kristján Valur Jónsson

1:12 p.m.

...

-----Original Message----- From: python-ideas-bounces+kristjan=ccpgames.com@python.org [mailto:python-ideas-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Nick Coghlan Sent: 21. október 2009 12:31 To: Sturla Molden Cc: python-ideas@python.org Subject: Re: [Python-ideas] Remove GIL with CAS instructions?

...
- The GIL has consequences on multicore CPUs that are overlooked: thread switches are usually missed at check intervals.

In the rare case where none of those threads are releasing the GIL for any other reason, time.sleep(0.001) works wonders. Yeah, it's a wart, but not a particularly difficult one to deal with.

I think you may be missing the point. If the GIL is implemented in a naive way on a platform, then releasing the gil and reclaiming it (such as is done during a checkinterval) can cause the same thread to get it again. Thus, the idea that this checkinterval should allow another thread waiting for the gil to run, would not work. Apparently, "fairness" of some locking primitives is being given up for efficiency reasons in some operatins systems, like Vista. I tested this by implementing a GIL like mechanism on a multicore maching using windows CriticalSections and Mutexes, both of which would starve the other threads. Semaphores work, though and the python locks on windows (using an atomic counter and an Event object) also work. So, alghough Sturla's claim isn't valid for windows, there might be systems where the syscheckinterval mechanism for thread yielding doesn't work due to the locks in question not being "fair" on multicore platforms. K

Nick Coghlan

22 Oct 22 Oct

9:02 p.m.

Kristján Valur Jónsson wrote:

...

I think you may be missing the point. If the GIL is implemented in a naive way on a platform, then releasing the gil and reclaiming it (such as is done during a checkinterval) can cause the same thread to get it again. Thus, the idea that this checkinterval should allow another thread waiting for the gil to run, would not work. Apparently, "fairness" of some locking primitives is being given up for efficiency reasons in some operatins systems, like Vista. I tested this by implementing a GIL like mechanism on a multicore maching using windows CriticalSections and Mutexes, both of which would starve the other threads. Semaphores work, though and the python locks on windows (using an atomic counter and an Event object) also work.

So, alghough Sturla's claim isn't valid for windows, there might be systems where the syscheckinterval mechanism for thread yielding doesn't work due to the locks in question not being "fair" on multicore platforms.

No, I understand the fairness problem - releasing the GIL for a whole millisecond is a workaround for exactly that issue. (Python must have suffered from this back on Windows NT 4, as that is where I first discovered this trick to fix a program that wasn't switching threads properly - before I added the time.sleep call to explicitly release the GIL and give other threads a chance to run before attempting to reacquire it, the program was running each thread to completion before starting the next one). It's a sledgehammer approach to the problem though and incurs a speed penalty even on platforms where the GIL thread scheduling is fairer. So it's better if we can find a way to ensure proper scheduling fairness across all platforms. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Adam Olsen

20 Oct 20 Oct

10:08 p.m.

On Tue, Oct 20, 2009 at 15:24, Sturla Molden <sturla@molden.no> wrote:

...

There are lock-free data structures of any conceivable sort. There are multithreaded garbage collectors in Java and .NET, that don't need a global lock. I've heard claims that removal of the GIL would require Python to use a garbage collector instead of reference counts. That is not true. Lock-free data structures and garbage collectors have all one thing in common: they use a compare-and-exchange (CAS) instruction present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is used to implement OS mutexes (sleeplocks) and the less expensive spinlocks. Without a CAS instruction for the platform, there would be no GIL. All platforms that has a working GIL has a working CAS.

CPython uses *extensive* refcounting operations on objects shared between threads (builtins, globals, literals, etc). The cache contention over those refcounts means even 2 threads are significantly slower than 1 thread, nevermind the increased contention of 3+ threads. That is why it is said a true tracing GC is needed to replace the refcounting GC. -- Adam Olsen, aka Rhamphoryncus

Antoine Pitrou

10:55 p.m.

Hello,

...

The cooperative multithreading intended by using checkintervals only works properly on single-processor systems. Python threads only work properly on single-processor computers.

I'm trying to work on that. I have an experimental rewrite of the GIL for POSIX systems which improves latencies, while decreasing the overhead of the GIL itself (because if checked every 100 opcodes, it can be dropped and re-taken tens of thousands of times per second...). If people have interesting workloads for py3k, the Mercurial branch can be found here: http://hg.pitrou.net/public/py3k/newgil/shortlog (my measurements, by the way, are based on a small concurrency benchmark for Python I recently wrote: http://svn.python.org/view/sandbox/trunk/ccbench/ )

...

There are lock-free data structures of any conceivable sort. There are multithreaded garbage collectors in Java and .NET, that don't need a global lock. I've heard claims that removal of the GIL would require Python to use a garbage collector instead of reference counts. That is not true. Lock-free data structures and garbage collectors have all one thing in common: they use a compare-and-exchange (CAS) instruction present in most modern processors, e.g. CMPXCHG on x86. In fact, CAS is used to implement OS mutexes (sleeplocks) and the less expensive spinlocks. Without a CAS instruction for the platform, there would be no GIL. All platforms that has a working GIL has a working CAS.

That does not mean that CAS (or atomic increment/decrement, for that matter) is as fast as a regular (non-atomic) increment / decrement. Besides, reference counting is not the only thing to worry about. You have to: - protect (somehow) the GC when collecting, so that things don't get modified under its feet - protect INCREFs and DECREFs - protect all mutable types (dicts being probably the most performance-critical, since they are used for namespaces and class/object attributes and methods) - protect all static/global data in C modules which have them - and probably other things I'm not thinking about An interesting experiment would be to add all those protections without even attempting to remove the GIL, and measure the overhead compared to a regular Python. IMO, if the overhead is less than 10%, then it would probably be acceptable to go ahead and try removing the GIL. But someone needs to tackle this concretely for it to happen, because I guess these ideas have already been suggested a couple of times... Regards Antoine.

Sturla Molden

11:46 p.m.

Antoine Pitrou skrev:

...

- protect (somehow) the GC when collecting, so that things don't get modified under its feet

The GC would have to temporarily suspend all active threads registered with the interpreter. E.g. using the SuspendThread function in Windows API.

...

- protect INCREFs and DECREFs

Update using CAS (I just showed you C code for this).

...

- protect all mutable types (dicts being probably the most performance-critical, since they are used for namespaces and class/object attributes and methods)

CAS. Lock-free lists and hash tables exist. We don't need locks to protect mutable types and class objects. Just use a lock-free hash-table for __dict__ or whatever.

...

- protect all static/global data in C modules which have them

Reacquire the GIL before calling functions in C extensions functions. The GIL is nice there, but not anywhere else.

...

An interesting experiment would be to add all those protections without even attempting to remove the GIL, and measure the overhead compared to a regular Python. Yes, one could add a nonsence CAS to existing incref/decref, and measure overhead.

Antoine Pitrou

11:50 p.m.

Sturla Molden <sturla@...> writes:

...

Yes, one could add a nonsence CAS to existing incref/decref, and measure overhead.

That would be the first thing to do IMO. If it doesn't seem to decrease performance a lot, you could continue by adding locking to dict objects, and measure again. Then add lists and sets (although I doubt sets and even lists are performance-critical for general interpreter operation). If after that you have less than a 10% slowdown (personal opinion only, other people's mileage may vary), it's probably worth going on and trying to do the whole GIL removal (which doesn't mean it would necessarily be accepted, by the way, but at least would avoid it being automatically rejected ;-)). Regards Antoine.

Adam Olsen

21 Oct 21 Oct

midnight

On Tue, Oct 20, 2009 at 17:50, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

Sturla Molden <sturla@...> writes:

...
Yes, one could add a nonsence CAS to existing incref/decref, and measure overhead.

That would be the first thing to do IMO. If it doesn't seem to decrease performance a lot, you could continue by adding locking to dict objects, and measure again. Then add lists and sets (although I doubt sets and even lists are performance-critical for general interpreter operation). If after that you have less than a 10% slowdown (personal opinion only, other people's mileage may vary), it's probably worth going on and trying to do the whole GIL removal (which doesn't mean it would necessarily be accepted, by the way, but at least would avoid it being automatically rejected ;-)).

This'd show you the minimum amount of overhead, but only for a trivial case: single threaded. You're not invoking cache contention, which in my experience is much larger than the atomic op cost, and prevents the *real* gains you hope to make. Python-safethread makes the datastructures safe (lockless fast-paths) so that it can be compared properly. I tried atomic refcounting and it sucked. My buffered bodge sucked less, but still sucked. The only option left is a real tracing GC. -- Adam Olsen, aka Rhamphoryncus

Sturla Molden

12:22 a.m.

Adam Olsen skrev:

...

This'd show you the minimum amount of overhead, but only for a trivial case: single threaded. As far as I know, the single-threaded case is the reason for keeping the GIL.

Adam Olsen

4:09 a.m.

On Tue, Oct 20, 2009 at 18:22, Sturla Molden <sturla@molden.no> wrote:

...

Adam Olsen skrev:

...
This'd show you the minimum amount of overhead, but only for a trivial case: single threaded.

As far as I know, the single-threaded case is the reason for keeping the GIL.

No, single-threaded is the reason for keeping the GIL *as an option*. The complete lack of any other viable option (in CPython) is the reason it's the only option. -- Adam Olsen, aka Rhamphoryncus

Steven D'Aprano

11:05 a.m.

On Wed, 21 Oct 2009 03:09:01 pm Adam Olsen wrote:

...

On Tue, Oct 20, 2009 at 18:22, Sturla Molden <sturla@molden.no> wrote:

...
Adam Olsen skrev:

...
This'd show you the minimum amount of overhead, but only for a trivial case: single threaded.

As far as I know, the single-threaded case is the reason for keeping the GIL.

No, single-threaded is the reason for keeping the GIL *as an option*. The complete lack of any other viable option (in CPython) is the reason it's the only option.

I'm not entirely sure what you're trying to say there, but I'd like link to what Guido said over two years ago: http://www.artima.com/weblogs/viewpost.jsp?thread=214235 [quote] ... I'd welcome a set of patches into Py3k only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease. I would also be happy if someone volunteered to maintain a GIL-free fork of Python... However, I want to warn that there are many downsides to removing the GIL. ... I want to point out one more time that the language doesn't require the GIL -- it's only the CPython virtual machine that has historically been unable to shed it. [end quote] -- Steven D'Aprano

Antoine Pitrou

24 Nov 24 Nov

6:39 p.m.

Ok, out of curiousity I gave it a try: replacing INCREF/DECREF with atomic instructions (*) slows down the interpreter by 20 to 40% depending on the workload. And keep in mind this is the tip of the iceberg: to remove the GIL you'd also have to add fine-grained locking in other places (dicts etc.). Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL. Regards Antoine. (*) using gcc's atomic primitives which, I have checked, are inlined as carefully optimized assembler: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Buil...

Guido van Rossum

7:10 p.m.

Note that Greg Stein reached this same conclusion (and similar numbers) over 10 years ago... On Tue, Nov 24, 2009 at 10:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

Ok, out of curiousity I gave it a try: replacing INCREF/DECREF with atomic instructions (*) slows down the interpreter by 20 to 40% depending on the workload. And keep in mind this is the tip of the iceberg: to remove the GIL you'd also have to add fine-grained locking in other places (dicts etc.).

Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL.

Regards

Antoine.

(*) using gcc's atomic primitives which, I have checked, are inlined as carefully optimized assembler: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Buil...

_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

-- --Guido van Rossum (python.org/~guido)

Collin Winter

7:21 p.m.

On Tue, Nov 24, 2009 at 11:10 AM, Guido van Rossum <guido@python.org> wrote:

...

Note that Greg Stein reached this same conclusion (and similar numbers) over 10 years ago...

It's worth repeating this kind of experiment; the hardware landscape has changed a lot in 10 years. It's interesting that the results are the same a decade later. Collin

...

On Tue, Nov 24, 2009 at 10:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

...
Ok, out of curiousity I gave it a try: replacing INCREF/DECREF with atomic instructions (*) slows down the interpreter by 20 to 40% depending on the workload. And keep in mind this is the tip of the iceberg: to remove the GIL you'd also have to add fine-grained locking in other places (dicts etc.).

Which makes me agree with the commonly expressed opinion that CPython would probably need to ditch refcounting (at least in the critical paths) if we want to remove the GIL.

Regards

Antoine.

(*) using gcc's atomic primitives which, I have checked, are inlined as carefully optimized assembler: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Buil...

_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

5430

Age (days ago)

5465

Last active (days ago)

List overview

Download

21 comments

10 participants

participants (10)

Adam Olsen
Antoine Pitrou
Benjamin Peterson
Casey Duncan
Collin Winter
Guido van Rossum
Kristján Valur Jónsson
Nick Coghlan
Steven D'Aprano
Sturla Molden

Remove GIL with CAS instructions?

Sturla Molden

Sturla Molden

Casey Duncan

Sturla Molden

Casey Duncan

Sturla Molden

Kristján Valur Jónsson

Kristján Valur Jónsson

Adam Olsen

Sturla Molden

Adam Olsen

Sturla Molden

Adam Olsen

tags

participants (10)