Mailman 3 A concurrency survey of sorts - Python-ideas

A concurrency survey of sorts

Mike Meyer

3 Nov 2011 3 Nov '11

1:06 a.m.

In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never. 1) How much of the Python standard library is known to be thread safe? 2) How many packages in PyPI are known to be thread safe? 3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered: a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then., b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation. 4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities: a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks. b) Add a high-level, high-performance shared object facility to the multiprocess package. Thanks,

Show replies by date

Yuval Greenfield

3 Nov 3 Nov

6:40 p.m.

turtle isn't thread safe. http://bugs.python.org/issue1702036 Also, here's just a random exception: Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information.

...

...
...
import turtle turtle.forward(10) import turtle from threading import Thread class walker(Thread): ... def run(self): ... for i in range(100): ... turtle.forward(10) ... turtle.left(10) ... [walker().start() for i in range(5)] [None, None, None, None, None] Exception in thread Thread-2: Traceback (most recent call last): File "c:\python32\lib\threading.py", line 736, in _bootstrap_inner self.run() File "<stdin>", line 4, in run File "<string>", line 1, in forward File "c:\python32\lib\turtle.py", line 1637, in forward self._go(distance) File "c:\python32\lib\turtle.py", line 1605, in _go self._goto(ende) File "c:\python32\lib\turtle.py", line 3159, in _goto screen._pointlist(self.currentLineItem), File "c:\python32\lib\turtle.py", line 755, in _pointlist cl = self.cv.coords(item) File "<string>", line 1, in coords File "c:\python32\lib\tkinter\__init__.py", line 2162, in coords self.tk.call((self._w, 'coords') + args))] File "c:\python32\lib\tkinter\__init__.py", line 2160, in <listcomp> return [getdouble(x) for x in ValueError: could not convert string to float: 'itemconfigure'

On Wed, Nov 2, 2011 at 9:36 PM, Mike Meyer wrote:

...

In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never.

1) How much of the Python standard library is known to be thread safe?

2) How many packages in PyPI are known to be thread safe?

3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered:

a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,

b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation.

4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities:

a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks.

b) Add a high-level, high-performance shared object facility to the multiprocess package.

Thanks, http://mail.python.org/mailman/listinfo/python-ideas

Antoine Pitrou

7:09 p.m.

On Wed, 2 Nov 2011 12:36:26 -0700 Mike Meyer wrote:

...

In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never.

1) How much of the Python standard library is known to be thread safe?

It depends what the thread safety assumptions are. I'd say not much of it, but not much of it *needs* to either. For example, if you mutate the same XML tree concurrently, my opinion is that the problem is with your code, not the stdlib :-) (on the other hand, if mutating *different* XML trees concurrently produces errors, then it's a stdlib bug) Buffered binary file objects are known to be thread-safe. Text file objects are not, except perhaps for writing (I think we did the latter, because of print() and logging; I'm not sure it's well tested, though). Raw file objects are not (they are "raw" after all: they simply expose the OS' behaviour). As a separate issue, binary file objects forbid reentrant accesses from signal handlers. Therefore I would advocate against using print() in a signal handler. (see http://bugs.python.org/issue10478)

...

b) Add a high-level, high-performance shared object facility to the multiprocess package.

It will be difficult (IMHO: very difficult) to devise such a thing. multiprocessing already has shared memory facilities, though - but they are very low-level. Regards Antoine.

Mike Graham

10:56 p.m.

On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer wrote:

...

1) How much of the Python standard library is known to be thread safe?

2) How many packages in PyPI are known to be thread safe?

"Thread safe" isn't nearly as well-defined as many people act, and certainly doesn't mean it's safe to use something with threads. When people try to use the very, very, very few things that are thread safe without their own synchronization, they almost always end up with buggy code. It's also worth noting that many of the most important concurrency-supporting packages in PyPI don't use multithreading at all.

...

3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered:

a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,

I don't see how this is feasible or makes Python a better language. This would add complication that doesn't benefit lots of people, would slow down normal cases, and wouldn't solve the datasharing problem for important cases that aren't just sharing memory between threads.

...

b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation.

I don't mean to be rude, but I don't understand how this is an idea at all. We already have lot of tools for sharing data predictably among threads, concurrent tasks, processes, and machines: Queue.Queue, thread locks, callbacks, MPI, message queues, and databases to name a few. Each of these has disadvantages and most of these have advantages.

...

4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities:

a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks.

I am not convinced adding this at a language level is net good at all. Flagging things as "thread unsafe" is silly, as practically everything is thread unsafe. Flagging things as "thread safe" is seldom useful, because you should still be handling synchronization in your code. Creating locks on everything in the stdlib would make Python bigger, more complex, and slower and still not solve concurrency problems for users – indeed, it could make them less apparent. And none of this goes to address concurrency that isn't based on multithreading, which is important and in many, many applications preferable.

...

b) Add a high-level, high-performance shared object facility to the multiprocess package.

The multiprocessing module already provides means to pass data which are fairly implicit. Trying to encapsulate the shared state as a Python object would be even more troublesome. Mike

Jim Jewett

4 Nov 4 Nov

8 a.m.

On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer wrote:

...

1) How much of the Python standard library is known to be thread safe?

None. Though our confidence in the threading library is fairly high (except when the underlying C library is broken). Not so long ago, there were a series of changes to the regression tests that boiled down getting rid of spurious failures caused by tests running serially, but in an unusual order. If that level of separation was still new, then finer-grained parallelism can't really be expected to work either. That said, test cases relied far more on global state than a typical module itself does, so problems are far more likely to occur in user code than in the library.

...

a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,

If you really want to do this, you should probably make the changes at the level of "object" (or "type") and inherit them everywhere. And it may simplify things to also change the memory allocation. There are a few projects for remote objects that already use a different memory model to enforce locking; you could start there.

...

b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed.

Actually, that isn't so bad. Common Lisp doesn't normally type variables at the source code level, but (a) You can explicitly add typing information if you want to, and (b) The compiler can often infer types If you want this to mesh with python, the constraints are similar; not only does the locking and safety marking have to be unobtrusive, it probably has to be optional. And there is existing (if largely superseded by Pypy) work on type inference for variables. -jJ

Adam Jorgensen

12:33 p.m.

Threads are unsafe, period. Personally, I think the threading packages should be removed from Python entirely. The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating. Personally I would rather see an Actors library... On 4 November 2011 04:30, Jim Jewett wrote:

...

On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer wrote:

...
1) How much of the Python standard library is known to be thread safe?

None. Though our confidence in the threading library is fairly high (except when the underlying C library is broken).

Not so long ago, there were a series of changes to the regression tests that boiled down getting rid of spurious failures caused by tests running serially, but in an unusual order. If that level of separation was still new, then finer-grained parallelism can't really be expected to work either.

That said, test cases relied far more on global state than a typical module itself does, so problems are far more likely to occur in user code than in the library.

...
a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,

If you really want to do this, you should probably make the changes at the level of "object" (or "type") and inherit them everywhere. And it may simplify things to also change the memory allocation.

There are a few projects for remote objects that already use a different memory model to enforce locking; you could start there.

...
b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed.

Actually, that isn't so bad. Common Lisp doesn't normally type variables at the source code level, but (a) You can explicitly add typing information if you want to, and (b) The compiler can often infer types

If you want this to mesh with python, the constraints are similar; not only does the locking and safety marking have to be unobtrusive, it probably has to be optional. And there is existing (if largely superseded by Pypy) work on type inference for variables.

-jJ _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Nick Coghlan

1:11 p.m.

On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen wrote:

...

The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.

This is just plain false. Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection). People need to remember that *concurrency is a hard problem*. That's why we layer abstractions on top of it. The threading and multiprocessing modules are both fairly low level, so they offer lots of ways to shoot yourself in the foot, but also a lot of power and flexibility. The concurrent.futures model is a higher level abstraction that's much easier to get right.

...

Personally I would rather see an Actors library...

And what is an actors library going to use as its concurrency mechanism if the threading and multiprocessing modules aren't there under the hood? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Adam Jorgensen

1:23 p.m.

On 4 November 2011 09:41, Nick Coghlan wrote:

...

On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen wrote:

...
The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.

This is just plain false. Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection).

I would love to seem some actual stats on this? How many multi-threaded apps are hitting the GIL barrier, etc... Anyway, I consider myself refuted...

...

People need to remember that *concurrency is a hard problem*. That's why we layer abstractions on top of it. The threading and multiprocessing modules are both fairly low level, so they offer lots of ways to shoot yourself in the foot, but also a lot of power and flexibility.

The concurrent.futures model is a higher level abstraction that's much easier to get right.

...
Personally I would rather see an Actors library...

And what is an actors library going to use as its concurrency mechanism if the threading and multiprocessing modules aren't there under the hood?

I said nothing about removing the multi-processing module, although it would be nice if it spawned child processes didn't randomly zombify for no good reason. Regardless, I still think the GIL should be fixed or the threading module removed. It's disingenuous to have a threading module when it doesn't work as advertised due to an interpreter "feature" of dubious merit anyway.

...

Cheers, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Stefan Behnel

2:46 p.m.

Adam Jorgensen, 04.11.2011 08:53:

...

On 4 November 2011 09:41, Nick Coghlan wrote:

...
On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen wrote:

...
The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.

This is just plain false.

The first part, yes. The second - depends. Threading, especially when applied to the wrong task, is a very good way to give you headaches.

...

...
Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection).

I would love to seem some actual stats on this? How many multi-threaded apps are hitting the GIL barrier, etc...

In the numerics corner, multi-threaded CPU bound code is surprisingly common. In multi-server setups, people commonly employ MPI&friends, but especially on multi-core machines (usually less than 64 cores), threading is quite widely used. Proof? Cython just gained support for OpenMP based parallel loops, due to popular request. However, computational code usually doesn't hit the "GIL barrier", as you call it, because the really heavy computations don't run in the interpreter but straight on the iron. So, no, neither I/O bound tasks nor CPU bound numerics tasks get in conflict with the GIL, unless you do something wrong. That's the main theme, BTW. If you're a frequent reader of c.l.py, you'll quickly notice that those who complain most loudly about the GIL usually just do so because they do something wrong. Threading isn't the silver bullet that you shoot at the task at hand. It's *one* way to solve *some* kinds of concurrency problems, and certainly not a trivial one. Stefan

4550

Age (days ago)

4552

Last active (days ago)

List overview

Download

8 comments

8 participants

participants (8)

Adam Jorgensen
Antoine Pitrou
Jim Jewett
Mike Graham
Mike Meyer
Nick Coghlan
Stefan Behnel
Yuval Greenfield

A concurrency survey of sorts

Mike Meyer

Yuval Greenfield

Antoine Pitrou

Mike Graham

Jim Jewett

Adam Jorgensen

Nick Coghlan

Adam Jorgensen

Stefan Behnel

tags

participants (8)