Exploration PEP : Concurrency for moderately massive (4 to 32 cores) multi-core architectures

PEP: xxxxxxxx Title: Concurrency for moderately massive (4 to 32 cores) multi-core architectures Version: $Revision$ Last-Modified: $Date$ Author: Krishna Sankar <ksankar (at) doubleclix.net>, Status: Wandering ! (as in "Not all those who wander are lost ..." -J.R.R.Tolkien) Type: Process Content-Type: text/x-rst Created: 15-Sep-2007 Abstract -------- This proposal aims at leveraging the multi-core capability as an embedded mechanism in python. It is not whether python is slow or fast, but of performance and control of parallelism/concurrency in a moderately massive parallelism world. The aim is 4 to 32 cores. The proposal advocates two mechanisms - one for task parallelism and another for data intensive parallelism. Scientific computing and web 2.0 frameworks are the forefront users for this proposal. Other applications would benefit as well. Rationale --------- Multicore architectures need no introductions and their ubiquity is evident. It is imperative that Python has one or more standard ways of leveraging multi-core architectures. OTOH, traditional thread based concurrency and lock based exclusions are becoming more and more difficult to program correctly. First of all, the question is not whether py is slow or fast but performance of a system written in py. Which means, ability to leverage multi-core architectures as well as control. Control in term of things like ability to pin one process/task to a core, ability to pin one or more homogeneous tasks to specific cores et al, as well as not wait for a global lock and similar primitives. (Before anybody jumps into a conclusion, this is not about GIL by any means ;o)) Second, it is clear that we need a good solution (not THE solution) for moderately massive parallelism in multi-core architectures (i.e. 8-32 cores). Share nothing might not be optimal; we need some form of memory sharing, not just copy all data via messages. May be functional programming based on the blackboard pattern would work, who knows. I have seen systems saturated still having only ~25% of CPU utilization (in a 4 core system!). It is because we didn't leverage multi-cores and parallelism. So while py3k will not be slow, lack of a cohesive multi-core strategy will show up in system performance and byte us later(pun intended!). At least, in my mind, this is not an exercise about exposing locks and mutexes or threads in Python. I do believe that the GIL will be refactored to more granularity in the coming months (similar to the Global Locks in Linux) and most probably we will get microThreads et al. As we all know, architecture is constraining as well as liberating. The language primitives influence greatly how we think about a problem. In the discussions, Guido is right in insisting on speed, and Bruce is right in asking for language constructs. Without pragmatic speed, folks won't use it; same is the case without the required constructs. Both are barriers to adoption. We have an opportunity to offer a solution for multi-core architectures and let us seize it - we will rush in where angels fear to tread! Programming Models ------------------ There are at least 3 possible paradigms A. conventional threading model B. Functional model, Erlang being the most appropriate C. Some form of limited shared memory model (message passing but pass pointers, blackboard model) D. Others, like Transactional Memory [2] There is enough literature out there, so do not plan to explain these here. (<KS> Do we need more explanation? </KS>) Pragmatic proposal ------------------ May I suggest we embed two primitives in Python 3K: A) A functional style share-nothing set of interfaces (and implementations thereof) - provides the task parallelism/concurrency capability, "small messages, big computations" as Joe Armstrong calls it[3] B) A limited shared memory based model for data intensive parallelism Most probably this would be part of stdlib. While Guido is almost right in saying that this is a (std)library problem, it is not fully so. We would need a few primitives from the underlying PVM substrate. Possibly one reason for Guido's position is the lack of clarity as to what needs to be changed and why. IMHO, just saying take GIL off does not solve the problem either. The Zen of Python parallelism ----------------------------- I draw inspiration for the very timely article by James Reinders in DDJ [1]. It embodies what we should be doing viz.: 1. Refactor the problem into parallel tasks. We cannot help if the domain is sequential 2. Program to abstraction & program chores not cores. Writing correct program using raw threads et al is difficult. Let the underlying substrate decide how best to optimize 3. Design for scale 4. Have an option to turn concurrency off, for debugging 5. Declarative parallelism based mechanisms (?) Related Efforts --------------- The good news is there are at least 2 or 3 paradigms with implementations and rough benchmarks. Hopefully we can leverage the implementations and mature them to stdlib (with required primitives in pvm) Parallel python http://www.artima.com/weblogs/viewpost.jsp?thread=214303 http://cheeseshop.python.org/pypi/parallel Processing http://cheeseshop.python.org/pypi/processing http://code.google.com/p/papyros/ Discussions ----------- There are at least four thread sets (pardon the pun !) I am aware of: 1. The GIL discussions in python-dev and Guido's blog on GIL http://www.artima.com/weblogs/viewpost.jsp?thread=214235 2. The py3k topics started by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214112, response by Guide http://www.artima.com/weblogs/viewpost.jsp?thread=214325 and reply to reply by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214480 3. Python and concurrency http://mail.python.org/pipermail/python-ideas/2007-March/000338.html References ---------- [1]http://www.ddj.com/architect/201804248 [2]Transaction http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=444 [3]Programming Erlang by Joe Armstrong

Folks, For some reason (fat fingers ;o() I missed the introduction to the proposal. Here is the full mail (pardon me for the spam): As a follow-up to the py3k discussions started by Bruce and Guido, I pinged Brett and he suggested I submit an exploratory proposal. Would appreciate insights, wisdom, the good, the bad and the ugly. A) Does it make sense ? B) Which application sets should we consider in designing the interfaces and implementations C) In this proposal, parallelism and concurrency are used in an interchangeable fashion. Thoughts ? D) Please suggest pertinent links, discussions and insights. E) I have kept the proposal to a minimum to start the discussions and to explore if this is the right thing to do. Collaboratively, as we zero-in on one or two approaches, the idea is to expand it to a crisp and clear PEP. Need to do some more formatting as well. ------------------------------------------------------------------------------------------------------------ PEP: xxxxxxxx Title: Concurrency for moderately massive (4 to 32 cores) multi-core architectures Version: $Revision$ Last-Modified: $Date$ Author: Krishna Sankar <ksankar (at) doubleclix.net>, Status: Wandering ! (as in "Not all those who wander are lost ..." -J.R.R.Tolkien) Type: Process Content-Type: text/x-rst Created: 15-Sep-2007 Abstract -------- This proposal aims at leveraging the multi-core capability as an embedded mechanism in python. It is not whether python is slow or fast, but of performance and control of parallelism/concurrency in a moderately massive parallelism world. The aim is 4 to 32 cores. The proposal advocates two mechanisms - one for task parallelism and another for data intensive parallelism. Scientific computing and web 2.0 frameworks are the forefront users for this proposal. Other applications would benefit as well. Rationale --------- Multicore architectures need no introductions and their ubiquity is evident. It is imperative that Python has one or more standard ways of leveraging multi-core architectures. OTOH, traditional thread based concurrency and lock based exclusions are becoming more and more difficult to program correctly. First of all, the question is not whether py is slow or fast but performance of a system written in py. Which means, ability to leverage multi-core architectures as well as control. Control in term of things like ability to pin one process/task to a core, ability to pin one or more homogeneous tasks to specific cores et al, as well as not wait for a global lock and similar primitives. (Before anybody jumps into a conclusion, this is not about GIL by any means ;o)) Second, it is clear that we need a good solution (not THE solution) for moderately massive parallelism in multi-core architectures (i.e. 8-32 cores). Share nothing might not be optimal; we need some form of memory sharing, not just copy all data via messages. May be functional programming based on the blackboard pattern would work, who knows. I have seen systems saturated still having only ~25% of CPU utilization (in a 4 core system!). It is because we didn't leverage multi-cores and parallelism. So while py3k will not be slow, lack of a cohesive multi-core strategy will show up in system performance and byte us later(pun intended!). At least, in my mind, this is not an exercise about exposing locks and mutexes or threads in Python. I do believe that the GIL will be refactored to more granularity in the coming months (similar to the Global Locks in Linux) and most probably we will get microThreads et al. As we all know, architecture is constraining as well as liberating. The language primitives influence greatly how we think about a problem. In the discussions, Guido is right in insisting on speed, and Bruce is right in asking for language constructs. Without pragmatic speed, folks won't use it; same is the case without the required constructs. Both are barriers to adoption. We have an opportunity to offer a solution for multi-core architectures and let us seize it - we will rush in where angels fear to tread! Programming Models ------------------ There are at least 3 possible paradigms A. conventional threading model B. Functional model, Erlang being the most appropriate C. Some form of limited shared memory model (message passing but pass pointers, blackboard model) D. Others, like Transactional Memory [2] There is enough literature out there, so do not plan to explain these here. (<KS> Do we need more explanation? </KS>) Pragmatic proposal ------------------ May I suggest we embed two primitives in Python 3K: A) A functional style share-nothing set of interfaces (and implementations thereof) - provides the task parallelism/concurrency capability, "small messages, big computations" as Joe Armstrong calls it[3] B) A limited shared memory based model for data intensive parallelism Most probably this would be part of stdlib. While Guido is almost right in saying that this is a (std)library problem, it is not fully so. We would need a few primitives from the underlying PVM substrate. Possibly one reason for Guido's position is the lack of clarity as to what needs to be changed and why. IMHO, just saying take GIL off does not solve the problem either. The Zen of Python parallelism ----------------------------- I draw inspiration for the very timely article by James Reinders in DDJ [1]. It embodies what we should be doing viz.: 1. Refactor the problem into parallel tasks. We cannot help if the domain is sequential 2. Program to abstraction & program chores not cores. Writing correct program using raw threads et al is difficult. Let the underlying substrate decide how best to optimize 3. Design for scale 4. Have an option to turn concurrency off, for debugging 5. Declarative parallelism based mechanisms (?) Related Efforts --------------- The good news is there are at least 2 or 3 paradigms with implementations and rough benchmarks. Parallel python http://www.artima.com/weblogs/viewpost.jsp?thread=214303 http://cheeseshop.python.org/pypi/parallel Processing http://cheeseshop.python.org/pypi/processing http://code.google.com/p/papyros/ Discussions ----------- There are at least four thread sets (pardon the pun !) I am aware of: 1. The GIL discussions in python-dev and Guido's blog on GIL http://www.artima.com/weblogs/viewpost.jsp?thread=214235 2. The py3k topics started by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214112, response by Guide http://www.artima.com/weblogs/viewpost.jsp?thread=214325 and reply to reply by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214480 3. Python and concurrency http://mail.python.org/pipermail/python-ideas/2007-March/000338.html References [1]http://www.ddj.com/architect/201804248 [2]Transaction http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=444 [3]Programming Erlang by Joe Armstrong

On 9/16/07, Krishna Sankar <ksankar@doubleclix.net> wrote:
I've been exploring this problem for a while so I've got some pretty strong opinions. I guess we'll find out if my ideas pass muster. :) ------------------------------------------------------------------------------------------------------------
I'm not sure just what "data intensive" means. I know some is basically a variant on vectorization, but I think that can be done much easier in a library than in the language proper. You're also missing distributed parallelism. There's a large domain in which you want failures to only bring down a single node, you'll willing to sacrifice shared state and consensus, you're willing to put in the extra effort to make it work. That's not ideal for the core of the language (where threading can do far better), but it's important to keep in mind how they play off each other, and which use cases can be better met by either.
I'm not sure how relevant processor affinity (or prioritization for that matter) . I suspect only around 5% of users (if that) will really need it. It seems that you'd need some deep cooperation with the OS to be really successful at it, and the support isn't there today. Of course support will likely improve as manycore becomes common.
I think share nothing is clearly unacceptable. Things like audio processing, video processing, gaming, or GUIs need it. If you don't support it directly you'll end up with a second-class form of objects where the true content is shared indirectly and a handle gets copied repeatedly. You lose a great deal of dynamism doing that.
This hints at a major compromise we can make. So long as the semantics are unchanged, we can offer a compile-time option to enable scalable threading, even though it may have a fairly significant amount of overhead. For 1 to 4 cores you'll still want high single-thread performance, but by the time you've got 8 cores it may be an easy decision to switch to a scalable version instead. Packagers could make this as easy as installing a different core package.
I've already got a patch/fork with the GIL refactored/removed, so I agree that it'll change. ;) I highly doubt we'll get any real microthreads though: CPython is built on C, and getting away from the C stack (and thus C's threads) is impractical. I agree it's about primitives though.
I'm not sure where my model fits in. What I do is take all the existing python objects and give them a shareable/non-shareable property. If if an object has some explicit semantics (such as a thread-safe queue or an immutable int) then it's shareable; otherwise it's non-shareable. All the communication mechanisms (queues, arguments to spawned threads, etc) check this property, so it becomes impossible to corrupt memory.
I agree that it's *mostly* a stdlib problem. The breadth of the useful tools will simply be part of the library. There are special cases where language modifications are needed though.
4 is made moot by better debuggers. I think that's more practical in the long run. Really, if you have producer and consumer threads, you *can't* flick a switch to make them serial.
I'd like to add a list of practical requirements a design must meet: * It must be composable with traditional single-threaded programs or libraries. Small changes are acceptable; complete redesigns are not. * It must be largely compatible with existing CPython code and extensions. The threading APIs will likely change, but replacing Py_INCREF/Py_DECREF with something else is too much. * It must be useful for a broad set of local problems, without becoming burdened down with speciality features. We don't need direct support for distributed computing. * It needs to be easy, reliable, and robust. Uncaught exceptions should gracefully abort the entire program with a stack trace, deadlocks should be detected and broken, and corruption should be impossible. Open for debate: * How much compatibility should be retained with existing concurrency mechanisms? If we're trying to propose something better then we obviously want to replace them, not add "yet another library", but transition is important too. (I mean this question to broadly apply to event-driven and threaded libraries alike.) -- Adam Olsen, aka Rhamphoryncus

Folks, For some reason (fat fingers ;o() I missed the introduction to the proposal. Here is the full mail (pardon me for the spam): As a follow-up to the py3k discussions started by Bruce and Guido, I pinged Brett and he suggested I submit an exploratory proposal. Would appreciate insights, wisdom, the good, the bad and the ugly. A) Does it make sense ? B) Which application sets should we consider in designing the interfaces and implementations C) In this proposal, parallelism and concurrency are used in an interchangeable fashion. Thoughts ? D) Please suggest pertinent links, discussions and insights. E) I have kept the proposal to a minimum to start the discussions and to explore if this is the right thing to do. Collaboratively, as we zero-in on one or two approaches, the idea is to expand it to a crisp and clear PEP. Need to do some more formatting as well. ------------------------------------------------------------------------------------------------------------ PEP: xxxxxxxx Title: Concurrency for moderately massive (4 to 32 cores) multi-core architectures Version: $Revision$ Last-Modified: $Date$ Author: Krishna Sankar <ksankar (at) doubleclix.net>, Status: Wandering ! (as in "Not all those who wander are lost ..." -J.R.R.Tolkien) Type: Process Content-Type: text/x-rst Created: 15-Sep-2007 Abstract -------- This proposal aims at leveraging the multi-core capability as an embedded mechanism in python. It is not whether python is slow or fast, but of performance and control of parallelism/concurrency in a moderately massive parallelism world. The aim is 4 to 32 cores. The proposal advocates two mechanisms - one for task parallelism and another for data intensive parallelism. Scientific computing and web 2.0 frameworks are the forefront users for this proposal. Other applications would benefit as well. Rationale --------- Multicore architectures need no introductions and their ubiquity is evident. It is imperative that Python has one or more standard ways of leveraging multi-core architectures. OTOH, traditional thread based concurrency and lock based exclusions are becoming more and more difficult to program correctly. First of all, the question is not whether py is slow or fast but performance of a system written in py. Which means, ability to leverage multi-core architectures as well as control. Control in term of things like ability to pin one process/task to a core, ability to pin one or more homogeneous tasks to specific cores et al, as well as not wait for a global lock and similar primitives. (Before anybody jumps into a conclusion, this is not about GIL by any means ;o)) Second, it is clear that we need a good solution (not THE solution) for moderately massive parallelism in multi-core architectures (i.e. 8-32 cores). Share nothing might not be optimal; we need some form of memory sharing, not just copy all data via messages. May be functional programming based on the blackboard pattern would work, who knows. I have seen systems saturated still having only ~25% of CPU utilization (in a 4 core system!). It is because we didn't leverage multi-cores and parallelism. So while py3k will not be slow, lack of a cohesive multi-core strategy will show up in system performance and byte us later(pun intended!). At least, in my mind, this is not an exercise about exposing locks and mutexes or threads in Python. I do believe that the GIL will be refactored to more granularity in the coming months (similar to the Global Locks in Linux) and most probably we will get microThreads et al. As we all know, architecture is constraining as well as liberating. The language primitives influence greatly how we think about a problem. In the discussions, Guido is right in insisting on speed, and Bruce is right in asking for language constructs. Without pragmatic speed, folks won't use it; same is the case without the required constructs. Both are barriers to adoption. We have an opportunity to offer a solution for multi-core architectures and let us seize it - we will rush in where angels fear to tread! Programming Models ------------------ There are at least 3 possible paradigms A. conventional threading model B. Functional model, Erlang being the most appropriate C. Some form of limited shared memory model (message passing but pass pointers, blackboard model) D. Others, like Transactional Memory [2] There is enough literature out there, so do not plan to explain these here. (<KS> Do we need more explanation? </KS>) Pragmatic proposal ------------------ May I suggest we embed two primitives in Python 3K: A) A functional style share-nothing set of interfaces (and implementations thereof) - provides the task parallelism/concurrency capability, "small messages, big computations" as Joe Armstrong calls it[3] B) A limited shared memory based model for data intensive parallelism Most probably this would be part of stdlib. While Guido is almost right in saying that this is a (std)library problem, it is not fully so. We would need a few primitives from the underlying PVM substrate. Possibly one reason for Guido's position is the lack of clarity as to what needs to be changed and why. IMHO, just saying take GIL off does not solve the problem either. The Zen of Python parallelism ----------------------------- I draw inspiration for the very timely article by James Reinders in DDJ [1]. It embodies what we should be doing viz.: 1. Refactor the problem into parallel tasks. We cannot help if the domain is sequential 2. Program to abstraction & program chores not cores. Writing correct program using raw threads et al is difficult. Let the underlying substrate decide how best to optimize 3. Design for scale 4. Have an option to turn concurrency off, for debugging 5. Declarative parallelism based mechanisms (?) Related Efforts --------------- The good news is there are at least 2 or 3 paradigms with implementations and rough benchmarks. Parallel python http://www.artima.com/weblogs/viewpost.jsp?thread=214303 http://cheeseshop.python.org/pypi/parallel Processing http://cheeseshop.python.org/pypi/processing http://code.google.com/p/papyros/ Discussions ----------- There are at least four thread sets (pardon the pun !) I am aware of: 1. The GIL discussions in python-dev and Guido's blog on GIL http://www.artima.com/weblogs/viewpost.jsp?thread=214235 2. The py3k topics started by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214112, response by Guide http://www.artima.com/weblogs/viewpost.jsp?thread=214325 and reply to reply by Bruce http://www.artima.com/weblogs/viewpost.jsp?thread=214480 3. Python and concurrency http://mail.python.org/pipermail/python-ideas/2007-March/000338.html References [1]http://www.ddj.com/architect/201804248 [2]Transaction http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=444 [3]Programming Erlang by Joe Armstrong

On 9/16/07, Krishna Sankar <ksankar@doubleclix.net> wrote:
I've been exploring this problem for a while so I've got some pretty strong opinions. I guess we'll find out if my ideas pass muster. :) ------------------------------------------------------------------------------------------------------------
I'm not sure just what "data intensive" means. I know some is basically a variant on vectorization, but I think that can be done much easier in a library than in the language proper. You're also missing distributed parallelism. There's a large domain in which you want failures to only bring down a single node, you'll willing to sacrifice shared state and consensus, you're willing to put in the extra effort to make it work. That's not ideal for the core of the language (where threading can do far better), but it's important to keep in mind how they play off each other, and which use cases can be better met by either.
I'm not sure how relevant processor affinity (or prioritization for that matter) . I suspect only around 5% of users (if that) will really need it. It seems that you'd need some deep cooperation with the OS to be really successful at it, and the support isn't there today. Of course support will likely improve as manycore becomes common.
I think share nothing is clearly unacceptable. Things like audio processing, video processing, gaming, or GUIs need it. If you don't support it directly you'll end up with a second-class form of objects where the true content is shared indirectly and a handle gets copied repeatedly. You lose a great deal of dynamism doing that.
This hints at a major compromise we can make. So long as the semantics are unchanged, we can offer a compile-time option to enable scalable threading, even though it may have a fairly significant amount of overhead. For 1 to 4 cores you'll still want high single-thread performance, but by the time you've got 8 cores it may be an easy decision to switch to a scalable version instead. Packagers could make this as easy as installing a different core package.
I've already got a patch/fork with the GIL refactored/removed, so I agree that it'll change. ;) I highly doubt we'll get any real microthreads though: CPython is built on C, and getting away from the C stack (and thus C's threads) is impractical. I agree it's about primitives though.
I'm not sure where my model fits in. What I do is take all the existing python objects and give them a shareable/non-shareable property. If if an object has some explicit semantics (such as a thread-safe queue or an immutable int) then it's shareable; otherwise it's non-shareable. All the communication mechanisms (queues, arguments to spawned threads, etc) check this property, so it becomes impossible to corrupt memory.
I agree that it's *mostly* a stdlib problem. The breadth of the useful tools will simply be part of the library. There are special cases where language modifications are needed though.
4 is made moot by better debuggers. I think that's more practical in the long run. Really, if you have producer and consumer threads, you *can't* flick a switch to make them serial.
I'd like to add a list of practical requirements a design must meet: * It must be composable with traditional single-threaded programs or libraries. Small changes are acceptable; complete redesigns are not. * It must be largely compatible with existing CPython code and extensions. The threading APIs will likely change, but replacing Py_INCREF/Py_DECREF with something else is too much. * It must be useful for a broad set of local problems, without becoming burdened down with speciality features. We don't need direct support for distributed computing. * It needs to be easy, reliable, and robust. Uncaught exceptions should gracefully abort the entire program with a stack trace, deadlocks should be detected and broken, and corruption should be impossible. Open for debate: * How much compatibility should be retained with existing concurrency mechanisms? If we're trying to propose something better then we obviously want to replace them, not add "yet another library", but transition is important too. (I mean this question to broadly apply to event-driven and threaded libraries alike.) -- Adam Olsen, aka Rhamphoryncus
participants (2)
-
Adam Olsen
-
Krishna Sankar