
I would like to direct your attention to a presentation that was given by Mr. Herb Sutter on the subject of concurrency support in programming languages. It's a long video, but I would urge you to watch at least the first five minutes of it: http://video.google.com/videoplay?docid=7625918717318948700&q=herb+sutter Mr. Sutter is the secretary of the ISO/C++ standards committee. I mention this only so that you'll understand he's not just some random yo-yo with an opinion. (There are a number of other papers by Mr. Sutter on the same topic which you can google for, however none of them are as compelling or comprehensive as the video presentation IMHO.) The gist of the introduction is this (my words, not his): Starting from now, and for the rest of your career as a programmer, you better start programming for massively concurrent systems or you are doomed to irrelevance. As he points out in the talk, the good news is that Moore's law is going to continue for several decades to come. The bad news is that most people don't understand Moore's law or what they get from it. Moore's law is really about the number of transistors on a chip. It says nothing about processor speed or performance. He calls it "the end of the free lunch", meaning that in the past you could write a program, wait a little while, and it would run faster - because the hardware got faster while you were waiting. This will no longer be true for single-threaded programs. Scalar processors, with all of their pipelining, prefetching and other tricks, have gotten about as fast as they are ever going to get; We've reached the point of diminishing returns. So the only thing left to do is throw more and more cores on the chip. There's no physical reason why the number of cores won't continue to double every 18 months; the only reason why manufacturers might choose not to make 128-core CPUs by the year 2012 is that there won't be enough software to take advantages of that degree of parallelism. (Even with today's transistor count, you could put 100 Pentium-class processors on a single die.) Moreover, programming languages which have built-in support for scaling to many processors are going to win in the long run, and programming languages that don't have strong, built-in concurrency support will gradually fade away. The reason for this is simple: People don't like using applications which don't get faster when they buy a better computer. Now, on the server side, all of this is a solved problem. We have a very robust model for handling hundreds of requests in parallel, and a very strong model for dealing with shared state. This model is called "transactions" or more formally ACID. On the client side, however, things are much worse. As he puts it, locks are the best we have, and locks suck. Locks are not "logically composable", in the sense that you can't simply take two pieces of software that were written independently and glue them together unless they share a common locking architecture. This ruins a large part of the power of modern software, which is the ability to compose software out of smaller, independently-written components. On the client side, you have heterogeneous components accessing shared state at high bandwidth with myriads of pointer-based references to that shared state. Add concurrency, and what you end up with is a system in which it is impossible to make logical inferences about the behavior of the system. (And before you ask - yes, functional languages - and their drawbacks - are addressed in the talk. The same is true for 'lock-free programming' and other techniques du jour.) Now, in the talk he proposes some strategies for building concurrency into the language so that client-side programming can be done in a way that isn't rocket science. We won't be able to get rid of locks, he claims, but with the right language support at least we'll be able to reason about them, and maybe push them off to the side so that they mostly stay in their corner. In any case, what I would like to open up is a discussion of how these ideas might influence the design of the Python language. One thing that is important to understand is that I'm not talking about "automatic parallelization" where the compiler automatically figures out what parts can be parallelized. That would require so much change to the Python language as to make it no longer Python. Nor am I talking about manually creating "thread" objects, and having to carefully place protections around any shared state. That kind of parallelism is too low-level, too coarse-grained, and too hard to do, even for people who think they know how to write concurrent code. I am not even necessarily talking about changing the Python language (although certainly the internals of the interpreter will need to be changed.) The same language can be used to describe the same kinds of problems and operations, but the implications of those language elements will change. This is analogous to the fact that these massively multicore CPUs 10 years from now will most likely be using the same instruction sets as today - but that does not mean that programming as a discipline will anything like what it is now. As an example of what I mean, suppose the Python 'with' statement could be used to indicate an atomic transaction. Any operations that were done within that block would either be committed or rolled back. This is not a new idea - here's an interesting paper on it: http://www-sal.cs.uiuc.edu/~zilles/tm.html I'm sure that there's a lot more like that out there. However, there is also a lot of stuff out there that is *superficially* similar to what I am talking about, and I want to make the distinction clear. For example, any discussion of concurrency in Python will naturally raise the topic of both IPython and Stackless. However, IPython (from what I understand) is really about distributed computing and not so much about fine-grained concurrency; And Stackless (from what I understand) is really about coroutines or continuations, which is a different kind of concurrency. Unless I am mistaken (and I very well could be) neither of these are immediately applicable to the problem of authoring Python programs for multi-core CPUs, but I think that both of them contain valuable ideas that are worth discussing. Now, I will be the first to admit that I am not an expert in these matters. Don't let my poor, naive ideas be a limit to what is discussed. What I've primarily tried to do in this posting is to get your attention and convince you that this is important and worth talking about. And I hope to be able to learn something along the way. My overall goal here is to be able to continue writing programs in Python 10 years from now, not just as a hobby, but as part of my professional work. If Python is able to leverage the power of the CPUs that are being created at that time, I will be able to make a strong case for doing so. On the other hand, if I have a 128-core CPU on my desk, and Python is only able to utilize 1/128th of that power without resorting to tedious calculations of race conditions and deadlocks, then its likely that my Python programming will be relegated to the role of a hobby. -- Talin

Talin wrote:
A Very nice introduction Talin. I will certainly look at the video tomorrow morning. Thanks. My first thought is it's not quite as bad as it seems, because any third party extensions will be able to use the remaining 127/128th of the power. ;-) You would need to subtract any CPU's used by the OS and other concurrent processes. (These would probably continue to use more resources as well.) I wonder if some of cpu's would be definable for special purposes or not? Maybe 64 of them set aside for simultaneous parallel calculations? There may be a way to express to the OS that a particular process on a data structure be carried out as 'Wide' as possible. ('Narrow' as possible being on a single CPU in a single Thread.) It might be vary nice for these 'wide' structures to have their own API as well. All speculation of course, ;-) Cheers, Ron

The problem is that rigth now python (on my linux box) will only let me have 381 threads. And if we want concurrent programming to be made easy, the user is not supposed to start its programms with "how many threads can i create? 500? ok, so i should partition my app like this". This leads to: a) applications that wont get faster when the user can create more than 500 threads b) or, a lot of complicated logic to partition the software on runtime The method should go the other way around: make all the threads you can think about, if there are enough cores, they will run in parallel. jm2c. Lucio Regards, Lucio.

"Lucio Torre" <lucio.torre@gmail.com> wrote:
But it's not about threads, it is about concurrent execution of code (which threads in Python do not allow). The only way to allow this is to basically attach a re-entrant lock on every single Python object (depending on the platform, perhaps 12 bytes minimum for count, process, thread). The sheer volume of the number of acquire/release cycles during execution is staggering (think about the number of incref/decref operations), and the increase in size of every object by around 12 bytes is not terribly appealing. On the upside, this is possible (change the PyObject_HEAD macro, PyINCREF, PyDECREF, remove the GIL), but the amount of work necessary to actually make it happen is huge, and it would likely result in negative performance until sheer concurrency wins out over the acquire/release overhead. - Josiah

Josiah Carlson wrote:
It seems to me some types of operations are more suited for concurrent operations than others, so maybe new objects that are designed to be naturally usable in this way could help. Or maybe there's a way to lock groups of objects at the same time by having them share a lock if they are related? I imagine there will be some low level C support that could be used transparently, such as copying large areas of memory with multiple CPU's. These may even be the existing C copy functions reimplemented to take advantage of multiple CPU environments so new versions of python may have limited use of this even if no support is explicitly added. Thinking out loud of ways a python program may use concurrent processing: * Applying a single function concurrently over a list. (A more limited function object might make this easier.) * Feeding a single set of arguments concurrently over a list of callables. * Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load) * Listcomps that perform the same calculation on each item may be a natural multi-processing structure. Ron

Ron Adam <rrr@ronadam.com> wrote:
That is a fine-grained vs. coarse-grained locking argument. There is literature. On the other hand, there is also the pi-calculus: http://scienceblogs.com/goodmath/2007/03/an_experiment_with_calculus_an_1.ph... Of course, the pi-calculus is not reasonable unless one starts with a lambda calculus and decides to modify it (parallel lisp?), so it isn't all that applicable here.
These examples are all what are generally referred to as "embarassingly parallel" in literature. One serious issue with programming parallel algorithms generally is that not all algorithms are necessarily parallelizable. Some are, certainly, but not all. The task is to discover those alternate algorithms that *are* parallelizable in such a way to offer gains that are "worth it". Personally, I think that if there were a *cheap* IPC to make cross-process calls not too expensive, many of the examples above that you talk about would be handled easily. - Josiah

Josiah Carlson wrote:
Interesting, but I think you are correct, pi-calculus is probably better suited to a special purpose library. Here's some more complete examples of what I was thinking. From a python programmers point of view. The underlying implementation would need to be worked out of course, either with locks, or messages, or by limiting access to mutable objects in some other way.
(1) xyz = vector(r) forall coords as c: # parallel modify coords 'inplace' with body c = rotate(c, xyz) * 'with' syntax form because 'c' does not outlive the body.
* Feeding a single set of arguments concurrently over a list of callables.
(2) x = 42 result = [None] * len(funcs) forall (funcs, result) as (f, r): r = f(x)
* Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load)
(3) This corresponds nicely to the suggested use of 'future' in the video. def foo(x=42): # Starts when process is created instead of waiting # for f.next() call to start. while 1: future yield x x += 42 f = foo() # start a concurrent process result = f.wait() # waits here for value if it's not ready.
* Listcomps that perform the same calculation on each item may be a natural multi-processing structure.
(4) result = [x = x**2 forall x in values]
Yes, they are obvious, but why not start with the obvious? It's what most users will understand first and it may be the less obvious stuff can be put in terms of the more obvious "embarrassingly parallel" things.
In the video he also talked about avoiding locks. I was thinking a more limited function object (for concurrent uses only) might be used that has: * no global keyword * only access to external immutable objects * no closures. In other words these need to pass 'values' explicitly at call time and return time. (or 'future yield' time) Ron

On 3/22/07, Ron Adam <rrr@ronadam.com> wrote:
On the upside, this is possible (change the PyObject_HEAD macro, PyINCREF, PyDECREF, remove the GIL), but the amount of work
The zillions of details are the same details you have to consider when writing a concurrent database. Having python store its objects externally and retrieve them when needed (vs allocating memory and pointers) would be a huge one-time loss, but it might eventually be worthwhile. PyPy calls that "external database" an "object_space", so the language support is already there, and still will be when the hardware makes it worthwhile.
Why not just keep using xyz = vector(r) rotate(xyz) and let the compiler take care of it? At most, we would want a way of marking objects as "read-only" or callables as "instant", so that the compiler would know it doesn't have to worry about the definition of "len" changing mid-stream. (Ron suggests something similar at the end of the message, and Talin's Transactional metaclass is related.)
* Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load)
On its own, this just makes response time worse. That said, generators are also callables, and the compiler might do a better job if it knew that it wouldn't have to worry about external redefinitions. There may also be some value in some sort of "idletasks" abstraction that says "hey, go ahead and precompute this, but only if you've got nothing better to do." Garbage collection could certainly benefit from this, and translating (or duplicating) string representations into multiple encodings.
In the video he also talked about avoiding locks. I was thinking a more limited function object (for concurrent uses only) might be used that has:
* no global keyword * only access to external immutable objects
Or at least, it doesn't mutate them itself, and it doesn't promise to use newer versions that get created after the call begins.
* no closures.
This level of information should be useful. But in practice, most of my functions already meet this definition. (In theory, they access builtins that could be replaced, etc.) I wouldn't want to mark them all by hand. I'm still inclined to trust the compiler, and just accept that it may eventually be the PyPy translator rather than the CPython interpreter. -jJ

Jim Jewett wrote:
Just how much should the 'python' compiler do? And what about dynamic modeling where things change according to input? (Not just thinking of graphics.)
Would it be possible to have a container for which it's contents are read only? (as a side effect of being in the container) Then individual items would not need their own read only attributes. A few other thoughts... Possibly ... when a module is first imported or ran nothing needs to be marked read only. Then when execution falls off the end, *everything* existing at that point is marked read only. If the module was run from as a script, then a main function is executed. These could be optional settings that could be turned on... from __far_future__ import multi-processing __multi-processing__ = True __main__ = "my_main" So then there would be an initiation first pass followed by an execution phase. In the initiation phase it's pretty much just the way things are now except you can't use threads. In the execution phase you can use threads, but you can't change anything created in the initiation phase. (Other controls would still be needed of course.)
Yes, you really wouldn't use these for very simple counters. For more complex things, the calling overhead becomes a much smaller percentage of the total, and it would have a bigger effect on smoothing out response time rather than hurting it.
Having a way to set process priority may be good. (or bad)
The difficulty is that even a method call on a global (or parent scope) object can result in it being mutated. So you are either back to using locks and/or transactions on everything.
Mine too. Ron

Josiah Carlson wrote:
A couple of ideas I wanted to explore: -- A base class or perhaps metaclass that makes python objects transactional. This wraps all attribute access so what you see is your transaction's view of the current state of the object. Something like: with atomic(): obj1.attribute = 1 # Other threads can't 'see' the new value until the # transaction commits. value = obj1.attribute 'atomic()' starts a new transaction in your thread, which is stored in thread-local data; Any objects that you mutate become part of the transaction automatically (will have to be careful about built-in mutable objects such as lists). At the end, the transaction either commits, or if there was a conflict, it rolls back and you get to do it over again. This would be useful for large networks of objects, where you want to make large numbers of local changes, where each local change affects an object and perhaps its surrounding objects. What I am describing is very similar to many complex 3D game worlds. ZODB already has a transactional object mechanism, although it's oriented towards database-style transactions and object persistence. -- A way to partition the space of python objects, such that objects in each partition cannot have references outside of the partition without going through some sort of synchronization mechanism, perhaps via some proxy. The idea is to be able to guarantee that shared state is only accessible in certain ways that you can reason about. -- Talin

Talin <talin@acm.org> wrote:
I'll watch it later today, but you seem to have done a great job of explaining what it is saying. [snip]
Given an inexpensive fork() with certain features (the optional ability to not copy threads, certain file handles, be able to quickly pass data between the parent and child processes, etc.), task-level concurrency is not quite as hard as it is right now. In the same way that one can use a threadpool to handle queued tasks, one could use whole processes to do the same thing, which gets us concurrency in Python. Of course we run into the issue where processor scheduling needs to get better to handle all of the processes, but that's going to be a requirement for these multi-core processors anyways. Windows doesn't have a .fork() (cygwin emulates it by using a shared mmap to copy all program state). Sometimes transferring objects between processes is difficult (file handles end up being relatively easy on *nix AND Windows), but there. Etc. - Josiah

On 3/22/07, Talin <talin@acm.org> wrote:
Thanks for the excellent summary. I won't try to summarize Brendan Eich's February blog entry on this topic, because it's short enough and worth reading: http://weblogs.mozillazine.org/roadmap/archives/2007/02/threads_suck.html -j

On 3/22/07, Jason Orendorff <jason.orendorff@gmail.com> wrote:
Right. I saw Sutter give this talk (or a very similar one) in Oxford last April, and I'm thoroughly unconvinced that Python is doomed unless it adds more thread support. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I'm unconvinced, too. Python has always relied mostly on C-level code to get grunty things done efficiently. If the C-level code knows how to take advantage of multiple cores, Python applications will benefit, too. As long as Numeric can make use of the other 127 cores, and OpenGL can run my 512-core GPU at full tilt, I'll be happy. :-) -- Greg

Thinking more about this, it seems to me that discussions of syntax for doing parallel operations and nifty classes for synchronization are a bit premature. The real question, it seems to me, is how to get Python to operate concurrently at all. Python's GIL cannot be removed without going through and fixing a thousand different places where different threads might access shared variables within the interpreter. Clearly, that's not the kind of work that is going to be done in a month or less. It might be useful to decompose that task further into some digestible chunks. One of these chunks is garbage collection. It seems to me that reference counting as it exists today would is a serious hindrance to concurrency, because it requires writing to an object each time you create a new reference. Instead, it should be possible to pass a reference to an object between threads, without actually modifying the object unless one of the threads actually changes an attribute. There are a number of papers on concurrent garbage collection out there on the web that might serve as a useful starting point. Of course, the .Net CLR and Java VM already have collectors of this type, so maybe those versions of Python already get this for free. I also wonder what other things that the GIL is protecting can be broken out as large, coherent chunks. -- Talin

On Sun, Mar 25, 2007, Talin wrote:
Maybe that's what it seems to you; to others of us who have been looking at this problem for a while, the real question is how to get a better multi-process control and IPC library in Python, preferably one that is cross-platform. You can investigate that right now, and you don't even need to discuss it with other people. (Despite my oft-stated fondness for threading, I do recognize the problems with threading, and if there were a way to make processes as simple as threads from a programming standpoint, I'd be much more willing to push processes.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Aahz wrote:
If you mean some sort of inter-process messaging system, there are a number that already exist; I'd look at IPython and py-globus for starters. My feeling is that while such an approach is vastly easier from the standpoint of Python developers, and may be easier from the standpoint of a typical Python programmer, it doesn't actually solve the problem that I'm attempting to address, which is figuring out how to write client-side software that dynamically scales to the number of processors on the system. My view is that while the number of algorithms that we have that can be efficiently parallelized in a fine-grained threading environment is small (compared to the total number of strictly sequential algorithms), the number of algorithms that can be adapted to heavy-weight, coarse-grained processes is much smaller still. For example, it is easy to imagine a quicksort routine where different threads are responsible for sorting various sub-partitions of the array. If this were to be done via processes, the overhead of marshalling and unmarshalling the array elements would completely swamp the benefits of making it concurrent. -- Talin

On Sun, Mar 25, 2007, Talin wrote:
How not? Keep in mind that if this kind of library becomes part of the Python Standard Library, the standard library can be written to use this multi-process library.
Maybe. I'm not convinced, but see below.
The problem, though, is that Threading Doesn't Work for what you're talking about. SMP threading doesn't really scale when you're talking about hundreds of CPUs. This kind of problem really is better handled at the library level: if it's worth splitting, the sort algorithm can figure out how to do that. (Whether it's threading or processes really doesn't matter, the sort algorithm just calls an underlying library to manage it. For example, it could put a lock around the list and the C library releases the GIL to do its work. As long as the overall sort() call was synchronous, it should work.) Generally speaking, it won't be worth splitting for less than a million elements... -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Talin <talin@acm.org> wrote:
At some point either the user or the system (Python) needs to figure out that splitting up a sequential task into multiple parallel tasks is productive. On the system end of things, that isn't easy. How much money and time has been poured into C/C++ compiler development, and about all they can auto parallelize (via vector operations) are things like: for (i=0;i<n;i++) a[i] = b[i] OP c[i]; for a restricted set of OP and input types for a, b, and c. Could Python do better than that? Sure, given enough time and research money.
But that algorithm wouldn't be used for sorting data on multiple processors. A variant of mergesort would be used (distribute blocks equally to processors, sort them individually, merge the results - in parallel). But again, all of this relies on two things: 1. a method for executing multiple streams of instructions simultaneously 2. a method of communication between the streams of instructions Without significant work, #1 isn't possible using threads in Python. It is trivial using processes. Without work, #2 isn't "fast" using processes in Python. It is trivial using threads. But here's the thing: with work, #2 can be made fast. Using unix domain sockets (on linux, 3.4 ghz P4 Xeons, DDR2-PC4200 memory (you can get 50% faster memory nowadays)), I've been able to push 400 megs/second between processes. Maybe anonymous or named pipes, or perhaps a shared mmap with some sort of synchronization would allow for IPC to be cross platform and just about as fast. The reason that I (and perhaps others) have been pushing for IPC is because it is easier to solve than the removal of Pythons threading limitations, with many of the same benefits, and even a few extra (being able to distribute processes across different machines). - Josiah

Talin wrote:
Yes, but it may help to have a few possible "end user" examples in mind while you work on the problem. The point isn't the exact syntax, but the uses that they imply and what other requirements these examples need in order to work.
The real question, it seems to me, is how to get Python to operate concurrently at all.
Yes, and I agree that it would be better to split the problem into smaller chunks. Maybe start by finding some "easier" to do simple cases first.
Here is an older discussion on removing the GIL and multiprocessing vs threading. Maybe it will be of help? http://mail.python.org/pipermail/python-dev/2005-September/056423.html
I'm not familiar with the details of python's garbage collecting yet. I was thinking the problem may be simplified by having a single mutable container-cell object. But that wouldn't be enough because in python it isn't just a problem of a mutable object changing, but one of a names reference to an object being changed. Could an explicit name space for sharing in threads and processes help?
I have no idea, ... Ron

I was thinking about thread killing, and why we think it's okay to kill OS processes but not threads. Killing an OS process is unlikely to cause other processes with which it's communicating to hang, since closing one end of a pipe or socket causes anything on the other end to get EOF on reading or a signal or error on writing. But the way threads usually communicate, using locks and queues, means that it's easy for a thread to get hung up waiting for something that another thread is supposed to do, but won't, because it's been killed. So I'm wondering if we want something for inter- thread communication that works something like a cross between a queue and a pipe. It knows what threads are connected to it, and if all threads on one end exit or get killed, threads on the other end find out about it. We could call it a Quipe or Sockqueue or Quocket (or if we want to be boring, a Channel). This would probably have to be hooked into the threads implementation at a low level, since it would need to be able to detect the death of a thread by any means, without relying on any cooperation from the user's code. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+

On 3/28/07, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
[Suggestion for a Queue variant that knows when one end is dead] I think the bigger problem is when threads don't restrict themselves to queues, but just use the same memory directly. If a thread dies in the middle of an "atomic" action, other threads will see corrupted memory. If a process dies then, nobody else would be able to see the memory anyhow. What we really need is a Task object that treats shared memory (perhaps with a small list of specified exceptions) as immutable. If you're willing to rely on style guidelines, then you can already get this today. If you want safety, and efficiency ... that may be harder to do as an addon. -jJ

Jim Jewett wrote:
I think a set of all of these as tools would be good. They are all different parts of the same elephant. And I don't see why it needs to be a single unified thing. * A 'better' task object for easily creating tasks. + We have a threading object now. (Needs improving.) * A message mechanism (Que-pipe) for getting the status of a task. + In and out message ques for communicating with a thread. + A way to wait on task events (messages) nicely. + A way for exceptions to propagate out of task objects. * Shared memory - + Prevent names from being rebound + Prevent objects from being altered (It isn't just about objects being immutable, but also about name's not being rebound to other objects. Unless you want to pass objects references for every object to every task? Or you trust that any imported code will play nice.) Would the following be a good summery of the issues? (Where the terms used mean) frozen: object can't be altered while frozen locked: name can't be rebound to another object Threading issues: (In order of difficulty) 1. Pass immutable objects back and forth. + Works now. 2. Share immutable objects by-ref. + Works now. 3. Pass mutable "deep" copies back and forth. ? Works now. (but not for all objects?) 4. Pass frozen mutable objects. - Needs freezable/unfreezable mutable objects. (Not the same as making an immutable copy.) 5. Share immutable object in a shared name space. - Needs name locking. 6. Share "frozen" mutable objects in shared name space. - Needs name locking - Needs freezable mutable objects 7. pass mutable objects. - Has conflicts with shared access. 8. Share mutable object by-ref. - Has conflicts. (same as #7.) 9. Share mutable object in shared name space. - Needs name locking. - Has conflicts. If we can make the first 6 of these work, that may be enough. 7,8 and 9 have to do with race conditions and other simultaneous data access issues. Name locking might work like this: Doing 'lock <name>' and 'unlock <name>' could just move the name to and from a locked name space in the same scope. Attempting to rebind a name while it's in a locked name space would raise an exception. The point is, rather than attach a lock to each individual name, it may be much easier, faster, and save memory by having a new name spaces for this. You could also pass a locked-name-space reference to a task all at once like we pass locals or globals now. It doesn't identify who locked what, so unlocking a name used by a thread would be in the "if it hurts, don't do that" category. If locking or unlocking a name in outside scopes is disallowed, then knowing who locked what won't be a problem. Freezing would be some way of preventing an object from being changed. It isn't concerned with the name it is referenced with. And it should not require copying the object. Part of that may be made easier by locking names.(?) Frozen objects may be useful in other ways besides threading.(?) Names locked to immutable objects act like constants so they may have other uses as well. Cheers, Ron

On 3/29/07, Ron Adam <rrr@ronadam.com> wrote:
* A 'better' task object for easily creating tasks. + We have a threading object now. (Needs improving.)
But the task isn't in any way restricted. Brett's security sandbox might be a useful starting point, if it is fast enough. Otherwise, we'll probably need to stick with microthreading to get things small enough to contain.
I had thought of the names as being part of a shared dictionary. (Of course, immutable dictionaries aren't really available out-of-the-box now, and I'm not sure I would trust the supposed immutability of anything that wasn't proxied.)
frozen: object can't be altered while frozen locked: name can't be rebound to another object
3. Pass mutable "deep" copies back and forth. ? Works now. (but not for all objects?)
Well, anything that can be deep-copied -- unless you also want the mutations to be collected back into a single location.
And there is where it starts to fall apart. Though if you look at the pypy dict and interpreter optimizations, they have started to deal with it through versioning types. http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi... http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23 -jJ

Jim Jewett wrote:
Not all that different. The immutable dictionary would be an implantation detail of a locked name space I think. I'm wondering if there might be a way to have an inheritance by container relationship, where certain characteristics can be acquired from parent containers. Not exactly the same as class inheritance.
It would need to be able to make a round trip.
I didn't find anything about "versioning" at these links. Did I miss it?
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi...
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23
_Ron

On 4/1/07, Ron Adam <rrr@ronadam.com> wrote:
Jim Jewett wrote:
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi...
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23
Sorry; I wasn't pointing to enough of the document. http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html discussed versions under method caching, just above the Interpreter Optimizations section. -jJ

Jim Jewett wrote:
Thanks, Found it. (Been busy with other things.) I may also depend on what abstraction level is desired. A high level of abstraction would hide all of this under the covers and it would be done transparently. A lower level would provide the tools needed to do it with, but also have the property of "if it hurts, don't do that." Ron

Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I don't know how useful the feature would be (I've not had this particular issue before), but one implementation strategy would be to use thread local storage and weak references to the incoming queue of other threads. Getting queues in the proper thread local storage between two threads is a little more tricky when you want it done automatically, but a couple lines of boilerplate and that's fixed. - Josiah

Talin wrote:
...
I'm not convinced you wouldn't have to change Python. After dorking around online for years, I've *finally* found someone who put into math-talk my biggest problem with current programming paradigms and how they relate to concurrency: http://alarmingdevelopment.org/index.php?p=5 I don't agree with everything in the post, but this part I do: "Most programming languages adopt a control flow model of execution, mirroring the hardware, in which there is an execution pointer flowing through the program. The primary reason for this is to permit side-effects to be ordered by the programmer. The problem is that interdependencies between side-effects are naturally a partial order, while control flow establishes a total (linear) order. This means that the actual design exists only in the programmer’s mind. It is up to the programmer to mentally compile (by a topological sort) these implicit dependencies into a total order expressed in a control flow. Whenever the program’s control flow is to be changed, the implicit interdependencies encoded into it must be remembered or guessed at in order to maintain them. Obviously the language should allow the partial order to be explicitly specified, and a compiler should take care of working out an execution schedule for it." There's an elephant-in-the-living-room UI problem, here: how would one go about extracting a partial order from a programmer? A text editor is fine for a total order, but I can't think of how I'd use one non-messily to define a partial order. How about a Gantt chart for a partial order, or some other kind of dependency diagram? How would you make it as easy to use as a text editor? The funny thing is, once you solve this problem, it may even be *easier* to program this way, because rather than maintaining the partial order in your head (or inferring it from a total order in the code), it'd be right in front of you. There's no reason a program with partial flow control couldn't have very Python-like syntax. After reading this, though, which formalized what I've long felt is the biggest problem with concurrent programming, I'd have to say it'd definitely not be Python itself. For the record, I disagree strongly with the "let's keep concurrency in the libraries" idea. I want to program in *Python*, dangit. Or at least something that feels a lot like it. Neil

Neil Toronto <ntoronto@cs.byu.edu> wrote: (I'm going to rearrange your post so that my reply flows a bit better)
It depends on what operations one wants to support. "Apply this function to all of this data" is easy. To say things like 'line x depends on line x-2, and line x+1 depends on line x-1, and line x+2 depends on line x and x+1, certainly that is not easy. But I question the purpose of being able to offer up that kind of information (in Python specifically). Presumably it is so that those tasks that don't depend on each other could be executed in parallel; but unless you have a method by which parallel execution is fast (or at least faster than just doing it in series), it's not terribly useful (especially if those operations are data structure manipulations that need to be propagated back to the 'main process').
Generally, the standard way of defining a partial order is via dependency graph. Unfortunately, breaking blocks of code into a dependency graph (or partial-order control-flow) tends to make the code hard to understand. I know there are various tools that use this particular kind of method, but those that I have seen leave much to be desired. Alternatively, there is a huge amount of R&D that has gone into C/C++ compilers to extract this information automatically from source code, and even more on the hardware end of things to automatically extract this information from machine code as it executes. Unfortunately, due to Python's dynamic nature, even something as simple as 'i += 0' can lead to all sorts of underlying system changes, and we may not be able to reliably extract this information (though PyPy with the LLVM backend may offer opportunities here).
And leaving concurrency in a library allows Python to stay Python. For certain tasks, one merely needs parallel variants of currently existing Python functions/operations. Take Google's MapReduce [1], which applies a function to a large number of data elements in parallel, then combines the results of those computations. While it is not universal, it can do certain operations quickly. Other tasks merely require the execution of *some function* while *some other function* is executing. Free threading, and various ways of allowing concurrent thread execution has been offered, but the more I read about the Processing package, the more I like it. These options don't offer a solution to what you seem to be wanting; an easy definition of partial order on code to be executed in Python. However, without language-level support for something like... exec lines in *block in *parallel: i += 1 j += fcn(foo) bar = fcn2(bar) ...I don't see how it is possible. Then again, I'm not sure I completely agree with Mr. Edwards or yourself in that being able to state partial ordering will offer improvements over the status quo. Then again, I tend to not worry about the blocks of 3-4 lines that aren't dependent on one another, as much as the entire function suite returning what I intended it to. - Josiah [1] http://labs.google.com/papers/mapreduce.html

Neil Toronto wrote:
There's an elephant-in-the-living-room UI problem, here: how would one go about extracting a partial order from a programmer?
Beat him or her with a stick? Just kidding of course. ;-) I think what you mean is how can we make it easier for a programmer to express their intention. One way is to provide a rich set of alternatives in extension modules and letting them choose. The ones that work will bubble up to the top, and the hard to manage and maintain choices will either be improved or forgotten.
Well, you wouldn't want to interleave several tasks instructions together in any fixed (or otherwise) way. That definitely would not be anything I would want to maintain. Being able to define serial blocks of code to execute in a parallel fashion can make some things easier to express because it gives you another way you can group related code together instead of having to split op, or combine unrelated, code together because of ordering dependencies. But addressing your partial order concerns, most likely you will have parallel structures communicating to one another with no apparent predefined order. The ordering could be completely dependent on the data they get and send to each other, and completely dependent on outside events. Think of tasks that can open multiple communication channels to other tasks as needed. What order would these be executed in? Who knows! <shrug> And you may not need to know. It may be that a partial-order execution order could be considered a subset of indeterminate execution order.
I think it would still be python. From what I see Python will continue to be improved on for quite some time. But these ideas must prove themselves to be pythonic before they get put in python. (My spell checker picked out polyphonic for pythonic... PolyPython?)
My guess, (if/when this is ever added), it will most likely be a combination of some basic supporting enhancements to names and objects so that they can work with task libraries better, along with one or more tasks/multi-processing libraries. If it turns out that the use of some of these ideas becomes both frequent and common. Then syntax similar to the 'with' statement might find support. But all of this is still quite a ways off unless some (yet to be identified) overwhelming need pushes it forward. Just my two cents, _Ron

Talin wrote:
A Very nice introduction Talin. I will certainly look at the video tomorrow morning. Thanks. My first thought is it's not quite as bad as it seems, because any third party extensions will be able to use the remaining 127/128th of the power. ;-) You would need to subtract any CPU's used by the OS and other concurrent processes. (These would probably continue to use more resources as well.) I wonder if some of cpu's would be definable for special purposes or not? Maybe 64 of them set aside for simultaneous parallel calculations? There may be a way to express to the OS that a particular process on a data structure be carried out as 'Wide' as possible. ('Narrow' as possible being on a single CPU in a single Thread.) It might be vary nice for these 'wide' structures to have their own API as well. All speculation of course, ;-) Cheers, Ron

The problem is that rigth now python (on my linux box) will only let me have 381 threads. And if we want concurrent programming to be made easy, the user is not supposed to start its programms with "how many threads can i create? 500? ok, so i should partition my app like this". This leads to: a) applications that wont get faster when the user can create more than 500 threads b) or, a lot of complicated logic to partition the software on runtime The method should go the other way around: make all the threads you can think about, if there are enough cores, they will run in parallel. jm2c. Lucio Regards, Lucio.

"Lucio Torre" <lucio.torre@gmail.com> wrote:
But it's not about threads, it is about concurrent execution of code (which threads in Python do not allow). The only way to allow this is to basically attach a re-entrant lock on every single Python object (depending on the platform, perhaps 12 bytes minimum for count, process, thread). The sheer volume of the number of acquire/release cycles during execution is staggering (think about the number of incref/decref operations), and the increase in size of every object by around 12 bytes is not terribly appealing. On the upside, this is possible (change the PyObject_HEAD macro, PyINCREF, PyDECREF, remove the GIL), but the amount of work necessary to actually make it happen is huge, and it would likely result in negative performance until sheer concurrency wins out over the acquire/release overhead. - Josiah

Josiah Carlson wrote:
It seems to me some types of operations are more suited for concurrent operations than others, so maybe new objects that are designed to be naturally usable in this way could help. Or maybe there's a way to lock groups of objects at the same time by having them share a lock if they are related? I imagine there will be some low level C support that could be used transparently, such as copying large areas of memory with multiple CPU's. These may even be the existing C copy functions reimplemented to take advantage of multiple CPU environments so new versions of python may have limited use of this even if no support is explicitly added. Thinking out loud of ways a python program may use concurrent processing: * Applying a single function concurrently over a list. (A more limited function object might make this easier.) * Feeding a single set of arguments concurrently over a list of callables. * Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load) * Listcomps that perform the same calculation on each item may be a natural multi-processing structure. Ron

Ron Adam <rrr@ronadam.com> wrote:
That is a fine-grained vs. coarse-grained locking argument. There is literature. On the other hand, there is also the pi-calculus: http://scienceblogs.com/goodmath/2007/03/an_experiment_with_calculus_an_1.ph... Of course, the pi-calculus is not reasonable unless one starts with a lambda calculus and decides to modify it (parallel lisp?), so it isn't all that applicable here.
These examples are all what are generally referred to as "embarassingly parallel" in literature. One serious issue with programming parallel algorithms generally is that not all algorithms are necessarily parallelizable. Some are, certainly, but not all. The task is to discover those alternate algorithms that *are* parallelizable in such a way to offer gains that are "worth it". Personally, I think that if there were a *cheap* IPC to make cross-process calls not too expensive, many of the examples above that you talk about would be handled easily. - Josiah

Josiah Carlson wrote:
Interesting, but I think you are correct, pi-calculus is probably better suited to a special purpose library. Here's some more complete examples of what I was thinking. From a python programmers point of view. The underlying implementation would need to be worked out of course, either with locks, or messages, or by limiting access to mutable objects in some other way.
(1) xyz = vector(r) forall coords as c: # parallel modify coords 'inplace' with body c = rotate(c, xyz) * 'with' syntax form because 'c' does not outlive the body.
* Feeding a single set of arguments concurrently over a list of callables.
(2) x = 42 result = [None] * len(funcs) forall (funcs, result) as (f, r): r = f(x)
* Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load)
(3) This corresponds nicely to the suggested use of 'future' in the video. def foo(x=42): # Starts when process is created instead of waiting # for f.next() call to start. while 1: future yield x x += 42 f = foo() # start a concurrent process result = f.wait() # waits here for value if it's not ready.
* Listcomps that perform the same calculation on each item may be a natural multi-processing structure.
(4) result = [x = x**2 forall x in values]
Yes, they are obvious, but why not start with the obvious? It's what most users will understand first and it may be the less obvious stuff can be put in terms of the more obvious "embarrassingly parallel" things.
In the video he also talked about avoiding locks. I was thinking a more limited function object (for concurrent uses only) might be used that has: * no global keyword * only access to external immutable objects * no closures. In other words these need to pass 'values' explicitly at call time and return time. (or 'future yield' time) Ron

On 3/22/07, Ron Adam <rrr@ronadam.com> wrote:
On the upside, this is possible (change the PyObject_HEAD macro, PyINCREF, PyDECREF, remove the GIL), but the amount of work
The zillions of details are the same details you have to consider when writing a concurrent database. Having python store its objects externally and retrieve them when needed (vs allocating memory and pointers) would be a huge one-time loss, but it might eventually be worthwhile. PyPy calls that "external database" an "object_space", so the language support is already there, and still will be when the hardware makes it worthwhile.
Why not just keep using xyz = vector(r) rotate(xyz) and let the compiler take care of it? At most, we would want a way of marking objects as "read-only" or callables as "instant", so that the compiler would know it doesn't have to worry about the definition of "len" changing mid-stream. (Ron suggests something similar at the end of the message, and Talin's Transactional metaclass is related.)
* Generators with the semantics of calculating first and waiting on 'yield' for 'next', so the value is immediately returned. (depends on CPU load)
On its own, this just makes response time worse. That said, generators are also callables, and the compiler might do a better job if it knew that it wouldn't have to worry about external redefinitions. There may also be some value in some sort of "idletasks" abstraction that says "hey, go ahead and precompute this, but only if you've got nothing better to do." Garbage collection could certainly benefit from this, and translating (or duplicating) string representations into multiple encodings.
In the video he also talked about avoiding locks. I was thinking a more limited function object (for concurrent uses only) might be used that has:
* no global keyword * only access to external immutable objects
Or at least, it doesn't mutate them itself, and it doesn't promise to use newer versions that get created after the call begins.
* no closures.
This level of information should be useful. But in practice, most of my functions already meet this definition. (In theory, they access builtins that could be replaced, etc.) I wouldn't want to mark them all by hand. I'm still inclined to trust the compiler, and just accept that it may eventually be the PyPy translator rather than the CPython interpreter. -jJ

Jim Jewett wrote:
Just how much should the 'python' compiler do? And what about dynamic modeling where things change according to input? (Not just thinking of graphics.)
Would it be possible to have a container for which it's contents are read only? (as a side effect of being in the container) Then individual items would not need their own read only attributes. A few other thoughts... Possibly ... when a module is first imported or ran nothing needs to be marked read only. Then when execution falls off the end, *everything* existing at that point is marked read only. If the module was run from as a script, then a main function is executed. These could be optional settings that could be turned on... from __far_future__ import multi-processing __multi-processing__ = True __main__ = "my_main" So then there would be an initiation first pass followed by an execution phase. In the initiation phase it's pretty much just the way things are now except you can't use threads. In the execution phase you can use threads, but you can't change anything created in the initiation phase. (Other controls would still be needed of course.)
Yes, you really wouldn't use these for very simple counters. For more complex things, the calling overhead becomes a much smaller percentage of the total, and it would have a bigger effect on smoothing out response time rather than hurting it.
Having a way to set process priority may be good. (or bad)
The difficulty is that even a method call on a global (or parent scope) object can result in it being mutated. So you are either back to using locks and/or transactions on everything.
Mine too. Ron

Josiah Carlson wrote:
A couple of ideas I wanted to explore: -- A base class or perhaps metaclass that makes python objects transactional. This wraps all attribute access so what you see is your transaction's view of the current state of the object. Something like: with atomic(): obj1.attribute = 1 # Other threads can't 'see' the new value until the # transaction commits. value = obj1.attribute 'atomic()' starts a new transaction in your thread, which is stored in thread-local data; Any objects that you mutate become part of the transaction automatically (will have to be careful about built-in mutable objects such as lists). At the end, the transaction either commits, or if there was a conflict, it rolls back and you get to do it over again. This would be useful for large networks of objects, where you want to make large numbers of local changes, where each local change affects an object and perhaps its surrounding objects. What I am describing is very similar to many complex 3D game worlds. ZODB already has a transactional object mechanism, although it's oriented towards database-style transactions and object persistence. -- A way to partition the space of python objects, such that objects in each partition cannot have references outside of the partition without going through some sort of synchronization mechanism, perhaps via some proxy. The idea is to be able to guarantee that shared state is only accessible in certain ways that you can reason about. -- Talin

Talin <talin@acm.org> wrote:
I'll watch it later today, but you seem to have done a great job of explaining what it is saying. [snip]
Given an inexpensive fork() with certain features (the optional ability to not copy threads, certain file handles, be able to quickly pass data between the parent and child processes, etc.), task-level concurrency is not quite as hard as it is right now. In the same way that one can use a threadpool to handle queued tasks, one could use whole processes to do the same thing, which gets us concurrency in Python. Of course we run into the issue where processor scheduling needs to get better to handle all of the processes, but that's going to be a requirement for these multi-core processors anyways. Windows doesn't have a .fork() (cygwin emulates it by using a shared mmap to copy all program state). Sometimes transferring objects between processes is difficult (file handles end up being relatively easy on *nix AND Windows), but there. Etc. - Josiah

On 3/22/07, Talin <talin@acm.org> wrote:
Thanks for the excellent summary. I won't try to summarize Brendan Eich's February blog entry on this topic, because it's short enough and worth reading: http://weblogs.mozillazine.org/roadmap/archives/2007/02/threads_suck.html -j

On 3/22/07, Jason Orendorff <jason.orendorff@gmail.com> wrote:
Right. I saw Sutter give this talk (or a very similar one) in Oxford last April, and I'm thoroughly unconvinced that Python is doomed unless it adds more thread support. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I'm unconvinced, too. Python has always relied mostly on C-level code to get grunty things done efficiently. If the C-level code knows how to take advantage of multiple cores, Python applications will benefit, too. As long as Numeric can make use of the other 127 cores, and OpenGL can run my 512-core GPU at full tilt, I'll be happy. :-) -- Greg

Thinking more about this, it seems to me that discussions of syntax for doing parallel operations and nifty classes for synchronization are a bit premature. The real question, it seems to me, is how to get Python to operate concurrently at all. Python's GIL cannot be removed without going through and fixing a thousand different places where different threads might access shared variables within the interpreter. Clearly, that's not the kind of work that is going to be done in a month or less. It might be useful to decompose that task further into some digestible chunks. One of these chunks is garbage collection. It seems to me that reference counting as it exists today would is a serious hindrance to concurrency, because it requires writing to an object each time you create a new reference. Instead, it should be possible to pass a reference to an object between threads, without actually modifying the object unless one of the threads actually changes an attribute. There are a number of papers on concurrent garbage collection out there on the web that might serve as a useful starting point. Of course, the .Net CLR and Java VM already have collectors of this type, so maybe those versions of Python already get this for free. I also wonder what other things that the GIL is protecting can be broken out as large, coherent chunks. -- Talin

On Sun, Mar 25, 2007, Talin wrote:
Maybe that's what it seems to you; to others of us who have been looking at this problem for a while, the real question is how to get a better multi-process control and IPC library in Python, preferably one that is cross-platform. You can investigate that right now, and you don't even need to discuss it with other people. (Despite my oft-stated fondness for threading, I do recognize the problems with threading, and if there were a way to make processes as simple as threads from a programming standpoint, I'd be much more willing to push processes.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Aahz wrote:
If you mean some sort of inter-process messaging system, there are a number that already exist; I'd look at IPython and py-globus for starters. My feeling is that while such an approach is vastly easier from the standpoint of Python developers, and may be easier from the standpoint of a typical Python programmer, it doesn't actually solve the problem that I'm attempting to address, which is figuring out how to write client-side software that dynamically scales to the number of processors on the system. My view is that while the number of algorithms that we have that can be efficiently parallelized in a fine-grained threading environment is small (compared to the total number of strictly sequential algorithms), the number of algorithms that can be adapted to heavy-weight, coarse-grained processes is much smaller still. For example, it is easy to imagine a quicksort routine where different threads are responsible for sorting various sub-partitions of the array. If this were to be done via processes, the overhead of marshalling and unmarshalling the array elements would completely swamp the benefits of making it concurrent. -- Talin

On Sun, Mar 25, 2007, Talin wrote:
How not? Keep in mind that if this kind of library becomes part of the Python Standard Library, the standard library can be written to use this multi-process library.
Maybe. I'm not convinced, but see below.
The problem, though, is that Threading Doesn't Work for what you're talking about. SMP threading doesn't really scale when you're talking about hundreds of CPUs. This kind of problem really is better handled at the library level: if it's worth splitting, the sort algorithm can figure out how to do that. (Whether it's threading or processes really doesn't matter, the sort algorithm just calls an underlying library to manage it. For example, it could put a lock around the list and the C library releases the GIL to do its work. As long as the overall sort() call was synchronous, it should work.) Generally speaking, it won't be worth splitting for less than a million elements... -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Talin <talin@acm.org> wrote:
At some point either the user or the system (Python) needs to figure out that splitting up a sequential task into multiple parallel tasks is productive. On the system end of things, that isn't easy. How much money and time has been poured into C/C++ compiler development, and about all they can auto parallelize (via vector operations) are things like: for (i=0;i<n;i++) a[i] = b[i] OP c[i]; for a restricted set of OP and input types for a, b, and c. Could Python do better than that? Sure, given enough time and research money.
But that algorithm wouldn't be used for sorting data on multiple processors. A variant of mergesort would be used (distribute blocks equally to processors, sort them individually, merge the results - in parallel). But again, all of this relies on two things: 1. a method for executing multiple streams of instructions simultaneously 2. a method of communication between the streams of instructions Without significant work, #1 isn't possible using threads in Python. It is trivial using processes. Without work, #2 isn't "fast" using processes in Python. It is trivial using threads. But here's the thing: with work, #2 can be made fast. Using unix domain sockets (on linux, 3.4 ghz P4 Xeons, DDR2-PC4200 memory (you can get 50% faster memory nowadays)), I've been able to push 400 megs/second between processes. Maybe anonymous or named pipes, or perhaps a shared mmap with some sort of synchronization would allow for IPC to be cross platform and just about as fast. The reason that I (and perhaps others) have been pushing for IPC is because it is easier to solve than the removal of Pythons threading limitations, with many of the same benefits, and even a few extra (being able to distribute processes across different machines). - Josiah

Talin wrote:
Yes, but it may help to have a few possible "end user" examples in mind while you work on the problem. The point isn't the exact syntax, but the uses that they imply and what other requirements these examples need in order to work.
The real question, it seems to me, is how to get Python to operate concurrently at all.
Yes, and I agree that it would be better to split the problem into smaller chunks. Maybe start by finding some "easier" to do simple cases first.
Here is an older discussion on removing the GIL and multiprocessing vs threading. Maybe it will be of help? http://mail.python.org/pipermail/python-dev/2005-September/056423.html
I'm not familiar with the details of python's garbage collecting yet. I was thinking the problem may be simplified by having a single mutable container-cell object. But that wouldn't be enough because in python it isn't just a problem of a mutable object changing, but one of a names reference to an object being changed. Could an explicit name space for sharing in threads and processes help?
I have no idea, ... Ron

I was thinking about thread killing, and why we think it's okay to kill OS processes but not threads. Killing an OS process is unlikely to cause other processes with which it's communicating to hang, since closing one end of a pipe or socket causes anything on the other end to get EOF on reading or a signal or error on writing. But the way threads usually communicate, using locks and queues, means that it's easy for a thread to get hung up waiting for something that another thread is supposed to do, but won't, because it's been killed. So I'm wondering if we want something for inter- thread communication that works something like a cross between a queue and a pipe. It knows what threads are connected to it, and if all threads on one end exit or get killed, threads on the other end find out about it. We could call it a Quipe or Sockqueue or Quocket (or if we want to be boring, a Channel). This would probably have to be hooked into the threads implementation at a low level, since it would need to be able to detect the death of a thread by any means, without relying on any cooperation from the user's code. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+

On 3/28/07, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
[Suggestion for a Queue variant that knows when one end is dead] I think the bigger problem is when threads don't restrict themselves to queues, but just use the same memory directly. If a thread dies in the middle of an "atomic" action, other threads will see corrupted memory. If a process dies then, nobody else would be able to see the memory anyhow. What we really need is a Task object that treats shared memory (perhaps with a small list of specified exceptions) as immutable. If you're willing to rely on style guidelines, then you can already get this today. If you want safety, and efficiency ... that may be harder to do as an addon. -jJ

Jim Jewett wrote:
I think a set of all of these as tools would be good. They are all different parts of the same elephant. And I don't see why it needs to be a single unified thing. * A 'better' task object for easily creating tasks. + We have a threading object now. (Needs improving.) * A message mechanism (Que-pipe) for getting the status of a task. + In and out message ques for communicating with a thread. + A way to wait on task events (messages) nicely. + A way for exceptions to propagate out of task objects. * Shared memory - + Prevent names from being rebound + Prevent objects from being altered (It isn't just about objects being immutable, but also about name's not being rebound to other objects. Unless you want to pass objects references for every object to every task? Or you trust that any imported code will play nice.) Would the following be a good summery of the issues? (Where the terms used mean) frozen: object can't be altered while frozen locked: name can't be rebound to another object Threading issues: (In order of difficulty) 1. Pass immutable objects back and forth. + Works now. 2. Share immutable objects by-ref. + Works now. 3. Pass mutable "deep" copies back and forth. ? Works now. (but not for all objects?) 4. Pass frozen mutable objects. - Needs freezable/unfreezable mutable objects. (Not the same as making an immutable copy.) 5. Share immutable object in a shared name space. - Needs name locking. 6. Share "frozen" mutable objects in shared name space. - Needs name locking - Needs freezable mutable objects 7. pass mutable objects. - Has conflicts with shared access. 8. Share mutable object by-ref. - Has conflicts. (same as #7.) 9. Share mutable object in shared name space. - Needs name locking. - Has conflicts. If we can make the first 6 of these work, that may be enough. 7,8 and 9 have to do with race conditions and other simultaneous data access issues. Name locking might work like this: Doing 'lock <name>' and 'unlock <name>' could just move the name to and from a locked name space in the same scope. Attempting to rebind a name while it's in a locked name space would raise an exception. The point is, rather than attach a lock to each individual name, it may be much easier, faster, and save memory by having a new name spaces for this. You could also pass a locked-name-space reference to a task all at once like we pass locals or globals now. It doesn't identify who locked what, so unlocking a name used by a thread would be in the "if it hurts, don't do that" category. If locking or unlocking a name in outside scopes is disallowed, then knowing who locked what won't be a problem. Freezing would be some way of preventing an object from being changed. It isn't concerned with the name it is referenced with. And it should not require copying the object. Part of that may be made easier by locking names.(?) Frozen objects may be useful in other ways besides threading.(?) Names locked to immutable objects act like constants so they may have other uses as well. Cheers, Ron

On 3/29/07, Ron Adam <rrr@ronadam.com> wrote:
* A 'better' task object for easily creating tasks. + We have a threading object now. (Needs improving.)
But the task isn't in any way restricted. Brett's security sandbox might be a useful starting point, if it is fast enough. Otherwise, we'll probably need to stick with microthreading to get things small enough to contain.
I had thought of the names as being part of a shared dictionary. (Of course, immutable dictionaries aren't really available out-of-the-box now, and I'm not sure I would trust the supposed immutability of anything that wasn't proxied.)
frozen: object can't be altered while frozen locked: name can't be rebound to another object
3. Pass mutable "deep" copies back and forth. ? Works now. (but not for all objects?)
Well, anything that can be deep-copied -- unless you also want the mutations to be collected back into a single location.
And there is where it starts to fall apart. Though if you look at the pypy dict and interpreter optimizations, they have started to deal with it through versioning types. http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi... http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23 -jJ

Jim Jewett wrote:
Not all that different. The immutable dictionary would be an implantation detail of a locked name space I think. I'm wondering if there might be a way to have an inheritance by container relationship, where certain characteristics can be acquired from parent containers. Not exactly the same as class inheritance.
It would need to be able to make a round trip.
I didn't find anything about "versioning" at these links. Did I miss it?
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi...
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23
_Ron

On 4/1/07, Ron Adam <rrr@ronadam.com> wrote:
Jim Jewett wrote:
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#multi...
http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html#id23
Sorry; I wasn't pointing to enough of the document. http://codespeak.net/pypy/dist/pypy/doc/interpreter-optimizations.html discussed versions under method caching, just above the Interpreter Optimizations section. -jJ

Jim Jewett wrote:
Thanks, Found it. (Been busy with other things.) I may also depend on what abstraction level is desired. A high level of abstraction would hide all of this under the covers and it would be done transparently. A lower level would provide the tools needed to do it with, but also have the property of "if it hurts, don't do that." Ron

Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I don't know how useful the feature would be (I've not had this particular issue before), but one implementation strategy would be to use thread local storage and weak references to the incoming queue of other threads. Getting queues in the proper thread local storage between two threads is a little more tricky when you want it done automatically, but a couple lines of boilerplate and that's fixed. - Josiah

Talin wrote:
...
I'm not convinced you wouldn't have to change Python. After dorking around online for years, I've *finally* found someone who put into math-talk my biggest problem with current programming paradigms and how they relate to concurrency: http://alarmingdevelopment.org/index.php?p=5 I don't agree with everything in the post, but this part I do: "Most programming languages adopt a control flow model of execution, mirroring the hardware, in which there is an execution pointer flowing through the program. The primary reason for this is to permit side-effects to be ordered by the programmer. The problem is that interdependencies between side-effects are naturally a partial order, while control flow establishes a total (linear) order. This means that the actual design exists only in the programmer’s mind. It is up to the programmer to mentally compile (by a topological sort) these implicit dependencies into a total order expressed in a control flow. Whenever the program’s control flow is to be changed, the implicit interdependencies encoded into it must be remembered or guessed at in order to maintain them. Obviously the language should allow the partial order to be explicitly specified, and a compiler should take care of working out an execution schedule for it." There's an elephant-in-the-living-room UI problem, here: how would one go about extracting a partial order from a programmer? A text editor is fine for a total order, but I can't think of how I'd use one non-messily to define a partial order. How about a Gantt chart for a partial order, or some other kind of dependency diagram? How would you make it as easy to use as a text editor? The funny thing is, once you solve this problem, it may even be *easier* to program this way, because rather than maintaining the partial order in your head (or inferring it from a total order in the code), it'd be right in front of you. There's no reason a program with partial flow control couldn't have very Python-like syntax. After reading this, though, which formalized what I've long felt is the biggest problem with concurrent programming, I'd have to say it'd definitely not be Python itself. For the record, I disagree strongly with the "let's keep concurrency in the libraries" idea. I want to program in *Python*, dangit. Or at least something that feels a lot like it. Neil

Neil Toronto <ntoronto@cs.byu.edu> wrote: (I'm going to rearrange your post so that my reply flows a bit better)
It depends on what operations one wants to support. "Apply this function to all of this data" is easy. To say things like 'line x depends on line x-2, and line x+1 depends on line x-1, and line x+2 depends on line x and x+1, certainly that is not easy. But I question the purpose of being able to offer up that kind of information (in Python specifically). Presumably it is so that those tasks that don't depend on each other could be executed in parallel; but unless you have a method by which parallel execution is fast (or at least faster than just doing it in series), it's not terribly useful (especially if those operations are data structure manipulations that need to be propagated back to the 'main process').
Generally, the standard way of defining a partial order is via dependency graph. Unfortunately, breaking blocks of code into a dependency graph (or partial-order control-flow) tends to make the code hard to understand. I know there are various tools that use this particular kind of method, but those that I have seen leave much to be desired. Alternatively, there is a huge amount of R&D that has gone into C/C++ compilers to extract this information automatically from source code, and even more on the hardware end of things to automatically extract this information from machine code as it executes. Unfortunately, due to Python's dynamic nature, even something as simple as 'i += 0' can lead to all sorts of underlying system changes, and we may not be able to reliably extract this information (though PyPy with the LLVM backend may offer opportunities here).
And leaving concurrency in a library allows Python to stay Python. For certain tasks, one merely needs parallel variants of currently existing Python functions/operations. Take Google's MapReduce [1], which applies a function to a large number of data elements in parallel, then combines the results of those computations. While it is not universal, it can do certain operations quickly. Other tasks merely require the execution of *some function* while *some other function* is executing. Free threading, and various ways of allowing concurrent thread execution has been offered, but the more I read about the Processing package, the more I like it. These options don't offer a solution to what you seem to be wanting; an easy definition of partial order on code to be executed in Python. However, without language-level support for something like... exec lines in *block in *parallel: i += 1 j += fcn(foo) bar = fcn2(bar) ...I don't see how it is possible. Then again, I'm not sure I completely agree with Mr. Edwards or yourself in that being able to state partial ordering will offer improvements over the status quo. Then again, I tend to not worry about the blocks of 3-4 lines that aren't dependent on one another, as much as the entire function suite returning what I intended it to. - Josiah [1] http://labs.google.com/papers/mapreduce.html

Neil Toronto wrote:
There's an elephant-in-the-living-room UI problem, here: how would one go about extracting a partial order from a programmer?
Beat him or her with a stick? Just kidding of course. ;-) I think what you mean is how can we make it easier for a programmer to express their intention. One way is to provide a rich set of alternatives in extension modules and letting them choose. The ones that work will bubble up to the top, and the hard to manage and maintain choices will either be improved or forgotten.
Well, you wouldn't want to interleave several tasks instructions together in any fixed (or otherwise) way. That definitely would not be anything I would want to maintain. Being able to define serial blocks of code to execute in a parallel fashion can make some things easier to express because it gives you another way you can group related code together instead of having to split op, or combine unrelated, code together because of ordering dependencies. But addressing your partial order concerns, most likely you will have parallel structures communicating to one another with no apparent predefined order. The ordering could be completely dependent on the data they get and send to each other, and completely dependent on outside events. Think of tasks that can open multiple communication channels to other tasks as needed. What order would these be executed in? Who knows! <shrug> And you may not need to know. It may be that a partial-order execution order could be considered a subset of indeterminate execution order.
I think it would still be python. From what I see Python will continue to be improved on for quite some time. But these ideas must prove themselves to be pythonic before they get put in python. (My spell checker picked out polyphonic for pythonic... PolyPython?)
My guess, (if/when this is ever added), it will most likely be a combination of some basic supporting enhancements to names and objects so that they can work with task libraries better, along with one or more tasks/multi-processing libraries. If it turns out that the use of some of these ideas becomes both frequent and common. Then syntax similar to the 'with' statement might find support. But all of this is still quite a ways off unless some (yet to be identified) overwhelming need pushes it forward. Just my two cents, _Ron
participants (10)
-
Aahz
-
Greg Ewing
-
Guido van Rossum
-
Jason Orendorff
-
Jim Jewett
-
Josiah Carlson
-
Lucio Torre
-
Neil Toronto
-
Ron Adam
-
Talin