In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never. 1) How much of the Python standard library is known to be thread safe? 2) How many packages in PyPI are known to be thread safe? 3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered: a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then., b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation. 4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities: a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks. b) Add a high-level, high-performance shared object facility to the multiprocess package. Thanks, <mike
turtle isn't thread safe. http://bugs.python.org/issue1702036 Also, here's just a random exception: Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information.
import turtle turtle.forward(10) import turtle from threading import Thread class walker(Thread): ... def run(self): ... for i in range(100): ... turtle.forward(10) ... turtle.left(10) ... [walker().start() for i in range(5)] [None, None, None, None, None] Exception in thread Thread-2: Traceback (most recent call last): File "c:\python32\lib\threading.py", line 736, in _bootstrap_inner self.run() File "<stdin>", line 4, in run File "<string>", line 1, in forward File "c:\python32\lib\turtle.py", line 1637, in forward self._go(distance) File "c:\python32\lib\turtle.py", line 1605, in _go self._goto(ende) File "c:\python32\lib\turtle.py", line 3159, in _goto screen._pointlist(self.currentLineItem), File "c:\python32\lib\turtle.py", line 755, in _pointlist cl = self.cv.coords(item) File "<string>", line 1, in coords File "c:\python32\lib\tkinter\__init__.py", line 2162, in coords self.tk.call((self._w, 'coords') + args))] File "c:\python32\lib\tkinter\__init__.py", line 2160, in <listcomp> return [getdouble(x) for x in ValueError: could not convert string to float: 'itemconfigure'
On Wed, Nov 2, 2011 at 9:36 PM, Mike Meyer <mwm@mired.org> wrote:
In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never.
1) How much of the Python standard library is known to be thread safe?
2) How many packages in PyPI are known to be thread safe?
3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered:
a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,
b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation.
4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities:
a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks.
b) Add a high-level, high-performance shared object facility to the multiprocess package.
Thanks, <mike _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Wed, 2 Nov 2011 12:36:26 -0700 Mike Meyer <mwm@mired.org> wrote:
In order to get a better idea of where things stand, I'd like to get answers to a few questions. This isn't a traditional broadbased survey, but an attempt to get answers from a few people who might know or have good ideas. This is probably where I should have started, but better late than never.
1) How much of the Python standard library is known to be thread safe?
It depends what the thread safety assumptions are. I'd say not much of it, but not much of it *needs* to either. For example, if you mutate the same XML tree concurrently, my opinion is that the problem is with your code, not the stdlib :-) (on the other hand, if mutating *different* XML trees concurrently produces errors, then it's a stdlib bug) Buffered binary file objects are known to be thread-safe. Text file objects are not, except perhaps for writing (I think we did the latter, because of print() and logging; I'm not sure it's well tested, though). Raw file objects are not (they are "raw" after all: they simply expose the OS' behaviour). As a separate issue, binary file objects forbid reentrant accesses from signal handlers. Therefore I would advocate against using print() in a signal handler. (see http://bugs.python.org/issue10478)
b) Add a high-level, high-performance shared object facility to the multiprocess package.
It will be difficult (IMHO: very difficult) to devise such a thing. multiprocessing already has shared memory facilities, though - but they are very low-level. Regards Antoine.
On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer <mwm@mired.org> wrote:
1) How much of the Python standard library is known to be thread safe?
2) How many packages in PyPI are known to be thread safe?
"Thread safe" isn't nearly as well-defined as many people act, and certainly doesn't mean it's safe to use something with threads. When people try to use the very, very, very few things that are thread safe without their own synchronization, they almost always end up with buggy code. It's also worth noting that many of the most important concurrency-supporting packages in PyPI don't use multithreading at all.
3) Can you suggest another approach to getting safe high-performance shared data in concurrent operation? I've already considered:
a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,
I don't see how this is feasible or makes Python a better language. This would add complication that doesn't benefit lots of people, would slow down normal cases, and wouldn't solve the datasharing problem for important cases that aren't just sharing memory between threads.
b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed. 2) I don't know that there's a working implementation.
I don't mean to be rude, but I don't understand how this is an idea at all. We already have lot of tools for sharing data predictably among threads, concurrent tasks, processes, and machines: Queue.Queue, thread locks, callbacks, MPI, message queues, and databases to name a few. Each of these has disadvantages and most of these have advantages.
4) Can you suggest a minor change that would move things toward safer concurrent code with high-performance shared data? I can see two possibilities:
a) Audit any parts of the standard library that aren't already known to be thread safe, and flag those that aren't. Fixing them may want to wait on a better mechanism than posix locks.
I am not convinced adding this at a language level is net good at all. Flagging things as "thread unsafe" is silly, as practically everything is thread unsafe. Flagging things as "thread safe" is seldom useful, because you should still be handling synchronization in your code. Creating locks on everything in the stdlib would make Python bigger, more complex, and slower and still not solve concurrency problems for users – indeed, it could make them less apparent. And none of this goes to address concurrency that isn't based on multithreading, which is important and in many, many applications preferable.
b) Add a high-level, high-performance shared object facility to the multiprocess package.
The multiprocessing module already provides means to pass data which are fairly implicit. Trying to encapsulate the shared state as a Python object would be even more troublesome. Mike
On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer <mwm@mired.org> wrote:
1) How much of the Python standard library is known to be thread safe?
None. Though our confidence in the threading library is fairly high (except when the underlying C library is broken). Not so long ago, there were a series of changes to the regression tests that boiled down getting rid of spurious failures caused by tests running serially, but in an unusual order. If that level of separation was still new, then finer-grained parallelism can't really be expected to work either. That said, test cases relied far more on global state than a typical module itself does, so problems are far more likely to occur in user code than in the library.
a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,
If you really want to do this, you should probably make the changes at the level of "object" (or "type") and inherit them everywhere. And it may simplify things to also change the memory allocation. There are a few projects for remote objects that already use a different memory model to enforce locking; you could start there.
b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed.
Actually, that isn't so bad. Common Lisp doesn't normally type variables at the source code level, but (a) You can explicitly add typing information if you want to, and (b) The compiler can often infer types If you want this to mesh with python, the constraints are similar; not only does the locking and safety marking have to be unobtrusive, it probably has to be optional. And there is existing (if largely superseded by Pypy) work on type inference for variables. -jJ
Threads are unsafe, period. Personally, I think the threading packages should be removed from Python entirely. The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating. Personally I would rather see an Actors library... On 4 November 2011 04:30, Jim Jewett <jimjjewett@gmail.com> wrote:
On Wed, Nov 2, 2011 at 3:36 PM, Mike Meyer <mwm@mired.org> wrote:
1) How much of the Python standard library is known to be thread safe?
None. Though our confidence in the threading library is fairly high (except when the underlying C library is broken).
Not so long ago, there were a series of changes to the regression tests that boiled down getting rid of spurious failures caused by tests running serially, but in an unusual order. If that level of separation was still new, then finer-grained parallelism can't really be expected to work either.
That said, test cases relied far more on global state than a typical module itself does, so problems are far more likely to occur in user code than in the library.
a) I proposed making actions that mutate data require locked objects, because I've seen that work in other languages. I recognize that doesn't mean it will work in Python, but it's more than I can say about the alternatives I knew about then.,
If you really want to do this, you should probably make the changes at the level of "object" (or "type") and inherit them everywhere. And it may simplify things to also change the memory allocation.
There are a few projects for remote objects that already use a different memory model to enforce locking; you could start there.
b) Bertrand Meyer's SCOOPS system, designed for Eiffel. It has two major strikes against it: 1) it is based on type attributes on *variables*, andI could figure out how to translate that to a language where variables aren't typed.
Actually, that isn't so bad. Common Lisp doesn't normally type variables at the source code level, but (a) You can explicitly add typing information if you want to, and (b) The compiler can often infer types
If you want this to mesh with python, the constraints are similar; not only does the locking and safety marking have to be unobtrusive, it probably has to be optional. And there is existing (if largely superseded by Pypy) work on type inference for variables.
-jJ _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen <adam.jorgensen.za@gmail.com> wrote:
The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.
This is just plain false. Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection). People need to remember that *concurrency is a hard problem*. That's why we layer abstractions on top of it. The threading and multiprocessing modules are both fairly low level, so they offer lots of ways to shoot yourself in the foot, but also a lot of power and flexibility. The concurrent.futures model is a higher level abstraction that's much easier to get right.
Personally I would rather see an Actors library...
And what is an actors library going to use as its concurrency mechanism if the threading and multiprocessing modules aren't there under the hood? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4 November 2011 09:41, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen <adam.jorgensen.za@gmail.com> wrote:
The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.
This is just plain false. Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection).
I would love to seem some actual stats on this? How many multi-threaded apps are hitting the GIL barrier, etc... Anyway, I consider myself refuted...
People need to remember that *concurrency is a hard problem*. That's why we layer abstractions on top of it. The threading and multiprocessing modules are both fairly low level, so they offer lots of ways to shoot yourself in the foot, but also a lot of power and flexibility.
The concurrent.futures model is a higher level abstraction that's much easier to get right.
Personally I would rather see an Actors library...
And what is an actors library going to use as its concurrency mechanism if the threading and multiprocessing modules aren't there under the hood?
I said nothing about removing the multi-processing module, although it would be nice if it spawned child processes didn't randomly zombify for no good reason. Regardless, I still think the GIL should be fixed or the threading module removed. It's disingenuous to have a threading module when it doesn't work as advertised due to an interpreter "feature" of dubious merit anyway.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Adam Jorgensen, 04.11.2011 08:53:
On 4 November 2011 09:41, Nick Coghlan wrote:
On Fri, Nov 4, 2011 at 5:03 PM, Adam Jorgensen wrote:
The GIL makes them pseudo-pointless in CPython anyway and the headaches arising from threading are very frustrating.
This is just plain false.
The first part, yes. The second - depends. Threading, especially when applied to the wrong task, is a very good way to give you headaches.
Threads are still an excellent way to take a synchronous operation and make it asynchronous. Take a look at concurrent.futures in 3.2, which makes it trivial to take independent blocking tasks and run them in parallel. The *only* time the GIL causes problems is when you have CPU bound threads written in pure Python. That's only a fraction of all of the Python apps out there, many of which are either calling out to calculations in C or FORTRAN (scientific community, financial community) or else performing IO bound tasks (most everyone else with a network connection).
I would love to seem some actual stats on this? How many multi-threaded apps are hitting the GIL barrier, etc...
In the numerics corner, multi-threaded CPU bound code is surprisingly common. In multi-server setups, people commonly employ MPI&friends, but especially on multi-core machines (usually less than 64 cores), threading is quite widely used. Proof? Cython just gained support for OpenMP based parallel loops, due to popular request. However, computational code usually doesn't hit the "GIL barrier", as you call it, because the really heavy computations don't run in the interpreter but straight on the iron. So, no, neither I/O bound tasks nor CPU bound numerics tasks get in conflict with the GIL, unless you do something wrong. That's the main theme, BTW. If you're a frequent reader of c.l.py, you'll quickly notice that those who complain most loudly about the GIL usually just do so because they do something wrong. Threading isn't the silver bullet that you shoot at the task at hand. It's *one* way to solve *some* kinds of concurrency problems, and certainly not a trivial one. Stefan
participants (8)
-
Adam Jorgensen
-
Antoine Pitrou
-
Jim Jewett
-
Mike Graham
-
Mike Meyer
-
Nick Coghlan
-
Stefan Behnel
-
Yuval Greenfield