pre-emptive micro-threads utilizing shared memory message passing?
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
Might as well warn you: This is going to be a rather long post. I'm not sure if this is appropriate to post here or if would fit right in with the mailing list. Sorry, if it is the wrong place to post about this. I've looked through the documenation (http://codespeak.net/pypy/dist/pypy/doc/stackless.html) and didn't really see what I was looking for. I've also investigated several options in the default CPython. What I'm trying to accomplish: I am trying to write a particular threading scenario that follows these rules. It is partly an experiment and partly for actual production code. 1. Hundreds or thousands of micro-threads that are essentially small self-contained programs (not really, but you can think of them that way). 2. No shared state - data is passed around from one micro-thread to another; only one micro-thread has access to the data at a time. (although the programmer gets the impression there is no shared state, in reality, the underlying implementation uses shared memory / shared state for speed; the data does not move; you just pass around a reference/pointer to some shared memory) 3. The micro-threads can run in parallel on different cpu cores, get moved to a different core, etc.... 4. The micro-threads are truly pre-emptive (uses hardware interrupt pre-emption). 5. It is my intention to write my own scheduler that will suspend the micro-threads, start them, control the sharing of data, assign them to different CPU cores etc.... In fact, for my purposes, I MUST write my own scheduler as I have very specific requirements on when they should and should not run. Now, I have spent some time trying to find a way to achieve this ... and I can implement a rather poor version using default Python. However, I don't see any way to implement my ideal version. Maybe someone here might have some pointers for me. Shared Memory between parallel processes ---------------------------------------- Quick Question: Do queues from the multiprocessing module use shared memory? If the answer is YES, you can just skip this section, because that would solve this particular problem. (For simplicity, let's assume a quad core CPU) It is my intent to create 4 threads/processs (one per core) and use the scheduler to assign a micro-thread (of which there may be hundreds) to one of the 4 threads/processes. However, the micro-threads need to exchange data quickly; to do that I need shared memory -- and that is where I'm having some trouble. Normally, 4 threads would be the ideal solution -- as they can run in parallel and use shared memory. However, because of the Python GIL, I can't use threads in this way; thus, I have to use 4 processes, which are not setup to share memory. Question: How can I share Python Objects between processes USING SHARED MEMORY? I do not want to have to copy or "pass" data back and forth between processes or have to use a proxy "server" process. These are both too much of a performance hit for my needs; shared memory is what I need. The multiprocessing module offers me 4 options: "queues", "pipes", "shared memory map", and a "server process". "Shared memory map" won't work as it only handles C values and arrays (not Python objects or variables). "Server Process" sounds like a bad idea. Am I correct in that this option requires extra processing power and does not even use shared memory? If so, that would be a very bad choice for me. The big question then... do "queues" and "pipes" used shared memory or do they pass data back and forth between processes? (if they used shared memory, then that would be perfect) Does PyPy have any other options for me? True Pre-emptive scheduling? ---------------------------- Any way to get pre-emptive micro-threads? Stackless (the real Stackless, not the one in PyPy) has the ability to suspend them after a certain number of interpreter instructions; however, this is prone to problems because it can run much longer than expected. Ideally, I would like to have true pre-emptive scheduling using hardware interrupts based on timing or CPU cycles (like the OS does for real threads). I am currently not aware of any way to achieve this in CPython, PyPy, Unladen Swallow, Stackless, etc.... Are there detailed docs on why the Python GIL exists? ----------------------------------------------------- I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it". It may be possible that my particular usage would not require the GIL. However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against. Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve? Thanks, Kevin _________________________________________________________________ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:W...
![](https://secure.gravatar.com/avatar/a38348763429b04b47da75e9dbdd9298.jpg?s=120&d=mm&r=g)
On 07/26 22:09, Kevin Ar18 wrote:
What I'm trying to accomplish:
I am trying to write a particular threading scenario that follows these rules. It is partly an experiment and partly for actual production code.
This is actually interesting to me as well. I can't count the number of times I've had to implement something like this for projects. It would be nice to be able to use a public module instead of writing it all yet again.
Now, I have spent some time trying to find a way to achieve this ... and I can implement a rather poor version using default Python. However, I don't see any way to implement my ideal version. Maybe someone here might have some pointers for me.
Shared Memory between parallel processes
This is the way I usually implement it. I'm currently mulling over some sort of byte-addressable abstraction that can use a buffer or any sequence as a backing store, which would make it useful for mmap objects as well. And I'm thinking about using the class definitions and inheritance to handle nested structures in some way.
Quick Question: Do queues from the multiprocessing module use shared memory? If the answer is YES, you can just skip this section, because that would solve this particular problem.
I can't imagine it wouldn't, but I haven't checked the source yet.
Question: How can I share Python Objects between processes USING SHARED MEMORY? I do not want to have to copy or "pass" data back and forth between processes or have to use a proxy "server" process. These are both too much of a performance hit for my needs; shared memory is what I need.
Anonymous memory-mapped regions would work, with a suitable data abstraction. Or even memory-mapped files, which aren't really all that different on systems anymore.
The multiprocessing module offers me 4 options: "queues", "pipes", "shared memory map", and a "server process". "Shared memory map" won't work as it only handles C values and arrays (not Python objects or variables).
cPickle could help. But then there's a serialization/deserialization step which wouldn't really be too fast. It's not slow, but the cost of copying the data is far outweighed by the cost of the dumps/loads, and if you need to share multiple copies you're really going to feel it.
"Server Process" sounds like a bad idea. Am I correct in that this option requires extra processing power and does not even use shared memory?
Not really. It depends on how you would implement it.
The big question then... do "queues" and "pipes" used shared memory or do they pass data back and forth between processes? (if they used shared memory, then that would be perfect)
Queues most likely do, pipes absolutely do not.
Does PyPy have any other options for me?
I wonder if it could be done with an object space, or similarly done "behind the scenes" in the PyPy interpreter, sort of the way ZODB works semi-transparently. Only in this case completely transparently.
True Pre-emptive scheduling?
This wouldn't really be difficult, although doing it efficiently might very well be without some serious black magic. But PyPy may also be the right tool for that since the black magic can be written in Python or RPython instead of C.
Any way to get pre-emptive micro-threads? Stackless (the real Stackless, not the one in PyPy) has the ability to suspend them after a certain number of interpreter instructions; however, this is prone to problems because it can run much longer than expected. Ideally, I would like to have true pre-emptive scheduling using hardware interrupts based on timing or CPU cycles (like the OS does for real threads).
By using a process for each thread, and some shared memory arena for the bulk of the application data structures, this is probably quite possible without reimplementing the OS in Python.
I am currently not aware of any way to achieve this in CPython, PyPy, Unladen Swallow, Stackless, etc....
I've done this a number of times, both with threads and with processes. Processes ironically give you finer control over scheduling since you aren't stuck behind the GIL, but as you are finding, you need some way to share data.
Are there detailed docs on why the Python GIL exists?
Here is the page from the Python Wiki: http://wiki.python.org/moin/GlobalInterpreterLock And here is an interesting article on the GIL problem: http://blog.ianbicking.org/gil-of-doom.html -- Evan Cofsky "The UNIX Man" <evan@tunixman.com>
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
On Tue, Jul 27, 2010 at 08:27, Evan Cofsky <evan@theunixman.com> wrote:
On 07/26 22:09, Kevin Ar18 wrote:
Are there detailed docs on why the Python GIL exists?
Here is the page from the Python Wiki:
To keep it short, CPython uses refcounting, and without the GIL the refcount incs and decs would need to be atomic, with a huge performance impact (that's discussed in the below links). However, you can look at this answer from Guido van Rossum: http://www.artima.com/weblogs/viewpost.jsp?thread=214235 And these two attempts to remove the GIL: http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Global_Interpreter... http://code.google.com/p/python-safethread/ PyPy does not have this problem, but you still need to make thread-safe the dictionaries holding members of each object. You don't need to make lists thread-safe, I think, because the programmer is supposed to lock them, but you want to allow a thread to add a member to an object while another thread performs a method call. Anyway, all this just explains why the GIL is still there, which is a slightly different question from the original one. With state-of-the-art technology, it is bad on every front, except simplicity of implementation.
And here is an interesting article on the GIL problem:
Given that processor frequencies aren't going to increase a lot in the future as they used to do, while the number of cores is going to increase much more, this article seems outdated nowadays - see also http://atlee.ca/blog/2006/06/27/python-warts-2/. This other link (http://poshmodule.sourceforge.net/) used to be interesting for the problem you are discussing, but seems also dead - there are other modules here: http://wiki.python.org/moin/ParallelProcessing. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
I won't even bother giving individual replies. It's going to take me some time to go through all that information on the GIL, so I guess there's no much of a reply I can give anyways. :) Let me explain what this is all about in greater detail. BTW, if there are more links on the GIL, feel free to post.
Anonymous memory-mapped regions would work, with a suitable data abstraction. Or even memory-mapped files, which aren't really all that different on systems anymore. I considered that... however, that would mean writing a significant library to convert Python data types to C/machine types and I wasn't looking forward to that prospect... although after some experimenting, maybe I will find that it won't be that big a deal for my particular situation.
----------------------- What this is all about: ----------------------- I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages. Consider p74 and p75 of the FBP book (http://www.jpaulmorrison.com/fbp/book.pdf). Programs essentially consist of many "black boxes" connected together. A box receives data, processes it and passes it along to another box, to output or drops/deletes it. Each box, is like a mini-program written in a traditional programming language (like C++ or Python). The process of connecting the boxes together was actually designed to be programmed visually, as you can see from the examples in the book (I have no idea if it works well, as I am merely starting to experiment with it). Each box, being a self contained "program," the only data it has access to is 3 parts: (1) it's own internal variables (2) The "in ports" These are connections from other boxes allowing the box to receive data to be processed (very similar to the arguments in a function call) (3) The "out ports" After processing the data, the box sends results to various "out ports" (which, in turn, go to anther box's "in port" or to system output). There is no "return" like in functions... and a box can continually generate many pieces of data on the "out ports", unlike a function which only generates one return. ------------------------ At this point, my understanding of the FBP concept is extremely limited. Unfortunately, the author does not have very detailed documentation on the implementation details. So, I am going to try exploring the concept on my own and see if I can actually use it in some production code. Implementation of FBP requires a custom scheduler for several reasons: (1) A box can only run if it has actual data on the "in port(s)" Thus, the scheduler would only schedule boxes to run when they can actually process some data. (2) In theory, it may be possible to end up with hundreds or thousands of these light weight boxes. Using heavy-weight OS threads or processes for every one is out of the question. The Kamaelia website describes a simplistic single-threaded way to write a scheduler in Python that would work for the FBP concept (even though they never heard of FBP when they designed Kamaelia). Based on that, it seems like writing a simple scheduler would be rather easy: In a perfect world, here's what I might do: * Assume a quad core cpu (1) Spawn 1 process (2) Spawn 4 threads & assign each thread to only 1 core -- in other words, don't let the OS handle moving threads around to different cores (3) Inside each thread, have a mini scheduler that switches back and forth between the many micro-threads (or "boxes") -- note that the OS should not handle any of the switching between micro-threads/boxes as it does it all wrong (and to heavyweight) for this situation. (4) Using a shared memory queue, each of the 4 schedulers can get the next box to run... or add more boxes to the schedule queue. (5) Each box has access to its "in ports" and "out ports" only -- and nothing else. These can be implemented as shared memory for speed. Some notes: Garbage Collection - I noticed that one of the issues mentioned about the GIL was garbage collection. Within the FBP concept, this MIGHT be easily solved: (a) only 1 running piece of code (1 box) can access a piece of data at a time, so there is no worries about whether there are dangling pointers to the var/object somewhere, etc... (b) data must be manually "dropped" inside a box to get rid of it; thus, there is no need to go checking for data that is not used anymore Threading protection - In theory, there is significantly less threading issues since: (a) only one box can control/access data at a time (b) the only place where there is contention is when you push/pop from the in/out ports ... and that is trivial to protect against. Anyways, I appreciate the replies. At this point, I guess I'll just go for a simplistic implementation to get a feel for how things work. Then, maybe I can check on if something better can be done in PyPy. _________________________________________________________________ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:W...
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 28 July 2010 04:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E. I do like the idea of MiniAxon, but let me mention a topic that has slowly been bubbling to the front of my mind for the last few months. Concurrency in the face of shared mutable state is hard. It makes it trivial to introduce bugs all over the place. Nondeterminacy related bugs are far harder to test, diagnose, and fix than anything else that I would almost mandate static verification (via optional typing, probably) of task noninterference if I was moving to a concurrent environment with shared mutable state. There might be a reasonable middle ground where, if a task attempts to violate the required static semantics, it fails dynamically. At least then, latent bugs make plenty of noise. An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects. I wonder if you could close such tasks off with a clever subclass of the proxy object space that detects and rejects such memory model violations? With only semantics that make the program deterministic? The moral equivalent would be cooperating processes with a large global (or not) shared memory store for immutable objects, queues for communication, and the additional semantic that objects in a queue are either immutable or the queue holds their only reference. The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian. -- William Leslie
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
On Wed, Jul 28, 2010 at 14:54, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 28 July 2010 04:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E.
Add Unladen Swallow to your list - the "Jython memory model" is undocumented. I don't know of Mont-E, can't find its website through Google (!), and there seems to be no such central resource.
I do like the idea of MiniAxon, but let me mention a topic that has slowly been bubbling to the front of my mind for the last few months.
Concurrency in the face of shared mutable state is hard. It makes it trivial to introduce bugs all over the place. Nondeterminacy related bugs are far harder to test, diagnose, and fix than anything else that I would almost mandate static verification (via optional typing, probably) of task noninterference if I was moving to a concurrent environment with shared mutable state.
This is a general issue with concurrency, and usually I try to solve this using more pencil-and-paper design than usual.
There might be a reasonable middle ground where, if a task attempts to violate the required static semantics, it fails dynamically. At least then, latent bugs make plenty of noise.
In general, I've seen lots of research on this, and something implemented in Valgrind - see here for links: http://blaisorbladeprog.blogspot.com/2010/07/automatic-race-detection.html. Given the interest on this, the lack of complete tools might mean that it is just too hard currently.
An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects.
I guess that 'close over' here means 'getting as input'.
I wonder if you could close such tasks off with a clever subclass of the proxy object space that detects and rejects such memory model violations? With only semantics that make the program deterministic?
The moral equivalent would be cooperating processes with a large global (or not) shared memory store for immutable objects, queues for communication, and the additional semantic that objects in a queue are either immutable or the queue holds their only reference.
In C++ auto_ptr do it, but that's hard in Python.
The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian.
If you want static guarantees, you need a statically typed language. The usual argument for dynamic languages is that instead of static typing, you need to write unit tests, and since you must do that anyway, dynamic languages are a win. We have two incomplete attempts to make programs correct: - Types give strong guarantees against a subclass of errors (you _never_ get certain errors from a program which compiles) - Testing gives weak guarantees (which go just as far as you test), but covers all classes of errors - The middle ground would be to require annotations to prove properties. One would need (once and for all) to annotate even strings as immutable! Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 28 July 2010 23:37, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
On Wed, Jul 28, 2010 at 14:54, William Leslie <william.leslie.ttg@gmail.com> wrote:
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E.
Add Unladen Swallow to your list - the "Jython memory model" is undocumented. I don't know of Mont-E, can't find its website through Google (!), and there seems to be no such central resource.
Mont-E was, for a long time, the hypothetical capability-secure subset of python based on E and discussed on cap-talk. A handful of people started work on it in earnest as a cpython fork fairly recently, but it does seem to be pretty quiet, and documentation free. I did find a repository and a presentation: http://bytebucket.org/habnabit/mont-e/overview https://docs.google.com/present/view?id=d9wrrrq_15ch78nq9n
This is a general issue with concurrency, and usually I try to solve this using more pencil-and-paper design than usual.
I found the following paper pretty interesting. The motivating study is some concurrency experts implementing software for proving the lack of deadlock in Java. Even with the sort of dedication that only a researcher with no life can provide, their deadlock inference software itself deadlocked after many years of use. www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects.
I guess that 'close over' here means 'getting as input'.
I mean that it keeps a reference to the objects between invocations. Hence, sharing mutable state.
The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian.
If you want static guarantees, you need a statically typed language. The usual argument for dynamic languages is that instead of static typing, you need to write unit tests, and since you must do that anyway, dynamic languages are a win.
One thing that many even very experienced hackers miss is that (static) types (and typesystems) actually cover a broad range of usages, and many of them are very different to the structural typesafety systems you are used to in C# and Java. A typesystem can prove anything that is statically computable, from the noninterference of effects to program termination, the ability to stack allocate data structures, and that privileged information can't be tainted. It's important to realise that these are orthogonal to, not supersets of, typesystems that validate structural safety. So it can be reasonable, if yet a little more difficult, to apply them to dynamic languages. -- William Leslie
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
As a followup to my earlier post: "pre-emptive micro-threads utilizing shared memory message passing?" I am actually finding that the biggest hurdle to accomplishing what I want is the lack of ANY type of shared memory -- even if it is limited. I wonder if I might ask a question: Would the following be a possible way to offer a limited type of shared memory: Summary: create a system very, very similar to POSH, but with differences: In detail, here's what I mean: * unlike POSH, utilize OS threads and shared memory (not processes) * Create a special shared memory location where you can place Python objects * Each Python object you place into this location can only be accessed (modified) by 1 thread. * You must manually assign ownership of an object to a particular thread. * The thread that "owns" the object is the only one that can modify it. * You can transfer ownership to another thread (but, as always only the owner can modify it). * There is no GIL when a thread interacts with these special objects. You can have true thread parallelism if your code uses a lot of these special objects. * The GIL remains in place for all other data access. * If your code has a mixture of access to the special objects and regular data, then once you hit a point where a thread starts to interact with data not in the special storage, then that thread must follow GIL rules. Granted, there might be some difficulty with the GIL part... but I thought I might ask anyways. :)
Date: Wed, 28 Jul 2010 22:54:39 +1000 Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? From: william.leslie.ttg@gmail.com To: kevinar18@hotmail.com CC: pypy-dev@codespeak.net
On 28 July 2010 04:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E.
I do like the idea of MiniAxon, but let me mention a topic that has slowly been bubbling to the front of my mind for the last few months.
Concurrency in the face of shared mutable state is hard. It makes it trivial to introduce bugs all over the place. Nondeterminacy related bugs are far harder to test, diagnose, and fix than anything else that I would almost mandate static verification (via optional typing, probably) of task noninterference if I was moving to a concurrent environment with shared mutable state. There might be a reasonable middle ground where, if a task attempts to violate the required static semantics, it fails dynamically. At least then, latent bugs make plenty of noise. An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects.
I wonder if you could close such tasks off with a clever subclass of the proxy object space that detects and rejects such memory model violations? With only semantics that make the program deterministic?
The moral equivalent would be cooperating processes with a large global (or not) shared memory store for immutable objects, queues for communication, and the additional semantic that objects in a queue are either immutable or the queue holds their only reference. The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian.
-- William Leslie
![](https://secure.gravatar.com/avatar/edcdfd5affb524e0f88ec1a00ed3fe5d.jpg?s=120&d=mm&r=g)
On Wed, Jul 28, 2010 at 8:33 PM, Kevin Ar18 <kevinar18@hotmail.com> wrote:
As a followup to my earlier post: "pre-emptive micro-threads utilizing shared memory message passing?"
I am actually finding that the biggest hurdle to accomplishing what I want is the lack of ANY type of shared memory -- even if it is limited. I wonder if I might ask a question:
Would the following be a possible way to offer a limited type of shared memory:
Summary: create a system very, very similar to POSH, but with differences:
In detail, here's what I mean: * unlike POSH, utilize OS threads and shared memory (not processes) * Create a special shared memory location where you can place Python objects * Each Python object you place into this location can only be accessed (modified) by 1 thread. * You must manually assign ownership of an object to a particular thread. * The thread that "owns" the object is the only one that can modify it. * You can transfer ownership to another thread (but, as always only the owner can modify it).
* There is no GIL when a thread interacts with these special objects. You can have true thread parallelism if your code uses a lot of these special objects. * The GIL remains in place for all other data access. * If your code has a mixture of access to the special objects and regular data, then once you hit a point where a thread starts to interact with data not in the special storage, then that thread must follow GIL rules.
Granted, there might be some difficulty with the GIL part... but I thought I might ask anyways. :)
Date: Wed, 28 Jul 2010 22:54:39 +1000 Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? From: william.leslie.ttg@gmail.com To: kevinar18@hotmail.com CC: pypy-dev@codespeak.net
On 28 July 2010 04:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E.
I do like the idea of MiniAxon, but let me mention a topic that has slowly been bubbling to the front of my mind for the last few months.
Concurrency in the face of shared mutable state is hard. It makes it trivial to introduce bugs all over the place. Nondeterminacy related bugs are far harder to test, diagnose, and fix than anything else that I would almost mandate static verification (via optional typing, probably) of task noninterference if I was moving to a concurrent environment with shared mutable state. There might be a reasonable middle ground where, if a task attempts to violate the required static semantics, it fails dynamically. At least then, latent bugs make plenty of noise. An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects.
I wonder if you could close such tasks off with a clever subclass of the proxy object space that detects and rejects such memory model violations? With only semantics that make the program deterministic?
The moral equivalent would be cooperating processes with a large global (or not) shared memory store for immutable objects, queues for communication, and the additional semantic that objects in a queue are either immutable or the queue holds their only reference. The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian.
-- William Leslie
_______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
Honestly, that sounds really difficult, out and out removing the GIL would probably be easier. Alex -- "I disapprove of what you say, but I will defend to the death your right to say it." -- Voltaire "The people's good is the highest law." -- Cicero "Code can always be simpler than you think, but never as simple as you want" -- Me
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
Honestly, that sounds really difficult, out and out removing the GIL would probably be easier. Based on the extremely limited info on the GIL, the big issue I noticed were two pieces of code trying to modify the same object at the same time because of the way they are stored internally in Python and because of garbage collection. I figured, if you have special objects that cannot be simultaneously accessed that maybe that would be possible.
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 29 July 2010 11:33, Kevin Ar18 <kevinar18@hotmail.com> wrote:
In detail, here's what I mean: * unlike POSH, utilize OS threads and shared memory (not processes) * Create a special shared memory location where you can place Python objects * Each Python object you place into this location can only be accessed (modified) by 1 thread. * You must manually assign ownership of an object to a particular thread. * The thread that "owns" the object is the only one that can modify it. * You can transfer ownership to another thread (but, as always only the owner can modify it).
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever. -- William Leslie
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Thu, Jul 29, 2010 at 7:18 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 11:33, Kevin Ar18 <kevinar18@hotmail.com> wrote:
In detail, here's what I mean: * unlike POSH, utilize OS threads and shared memory (not processes) * Create a special shared memory location where you can place Python objects * Each Python object you place into this location can only be accessed (modified) by 1 thread. * You must manually assign ownership of an object to a particular thread. * The thread that "owns" the object is the only one that can modify it. * You can transfer ownership to another thread (but, as always only the owner can modify it).
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever.
-- William Leslie
Hi. Do you have any data points supporting your claim? Cheers, fijal
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 29 July 2010 17:27, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 7:18 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever.
-- William Leslie
Hi.
Do you have any data points supporting your claim?
About the performance of programs that involve a cache miss on every memory access, or internal consistency? -- William Leslie
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Thu, Jul 29, 2010 at 9:32 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 17:27, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 7:18 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever.
-- William Leslie
Hi.
Do you have any data points supporting your claim?
About the performance of programs that involve a cache miss on every memory access, or internal consistency?
I think I lost some implication here. Did I get you right - you claim that per-object locking in case threads share obejcts are very expensive, is that correct? If not, I completely misunderstood you and my question makes no sense, please explain. If yes, why does it mean a cache miss on every read/write? Cheers, fijal
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 29 July 2010 17:40, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 9:32 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 17:27, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 7:18 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever.
-- William Leslie
Hi.
Do you have any data points supporting your claim?
About the performance of programs that involve a cache miss on every memory access, or internal consistency?
I think I lost some implication here. Did I get you right - you claim that per-object locking in case threads share obejcts are very expensive, is that correct? If not, I completely misunderstood you and my question makes no sense, please explain. If yes, why does it mean a cache miss on every read/write?
I claim that there are two alternatives in the face of one thread mutating an object and the other observing: 0. You can give up consistency and do fine-grained locking, which is reasonably fast but error prone, or 1. Expect python to handle all of this for you, effectively not making a change to the memory model. You could do this with implicit per-object locks which might be reasonably fast in the absence of contention, but not when several threads are trying to use the object. Queues already are in a sense your per-object-lock, one-thread-mutating, but usually one thread has acquire semantics and one has release semantics, and that combination actually works. It's when you expect to have a full memory barrier that is the problem. Come to think of it, you might be right Kevin: as long as only one thread mutates the object, the mutating thread never /needs/ to acquire, as it knows that it has the latest revision. Have I missed something? -- William Leslie
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Thu, Jul 29, 2010 at 9:57 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 17:40, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 9:32 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 17:27, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 7:18 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
When an object is mutable, it must be visible to at most one thread. This means it can participate in return values, arguments and queues, but the sender cannot keep a reference to an object it sends, because if the receiver mutates the object, this will need to be reflected in the sender's thread to ensure internal consistency. Well, you could ignore internal consistency, require explicit locking, and have it segfault when the change to the length of your list has propogated but not the element you have added, but that wouldn't be much fun. The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower. So you really can't have two tasks sharing mutable objects, ever.
-- William Leslie
Hi.
Do you have any data points supporting your claim?
About the performance of programs that involve a cache miss on every memory access, or internal consistency?
I think I lost some implication here. Did I get you right - you claim that per-object locking in case threads share obejcts are very expensive, is that correct? If not, I completely misunderstood you and my question makes no sense, please explain. If yes, why does it mean a cache miss on every read/write?
I claim that there are two alternatives in the face of one thread mutating an object and the other observing:
0. You can give up consistency and do fine-grained locking, which is reasonably fast but error prone, or 1. Expect python to handle all of this for you, effectively not making a change to the memory model. You could do this with implicit per-object locks which might be reasonably fast in the absence of contention, but not when several threads are trying to use the object.
Queues already are in a sense your per-object-lock, one-thread-mutating, but usually one thread has acquire semantics and one has release semantics, and that combination actually works. It's when you expect to have a full memory barrier that is the problem.
Come to think of it, you might be right Kevin: as long as only one thread mutates the object, the mutating thread never /needs/ to acquire, as it knows that it has the latest revision.
Have I missed something?
-- William Leslie
So my question is why you think 1. is really expensive (can you find evidence). I don't see what is has to do with cache misses. Besides, in python you cannot guarantee much about mutability of objects. So you don't know if object passed in a queue is mutable or not, unless you restrict yourself to some very simlpe types (in which case there is no shared memory, since you only pass immutable objects). Cheers, fijal
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 29 July 2010 18:02, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 9:57 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
I claim that there are two alternatives in the face of one thread mutating an object and the other observing:
0. You can give up consistency and do fine-grained locking, which is reasonably fast but error prone, or 1. Expect python to handle all of this for you, effectively not making a change to the memory model. You could do this with implicit per-object locks which might be reasonably fast in the absence of contention, but not when several threads are trying to use the object.
Queues already are in a sense your per-object-lock, one-thread-mutating, but usually one thread has acquire semantics and one has release semantics, and that combination actually works. It's when you expect to have a full memory barrier that is the problem.
Come to think of it, you might be right Kevin: as long as only one thread mutates the object, the mutating thread never /needs/ to acquire, as it knows that it has the latest revision.
Have I missed something?
-- William Leslie
So my question is why you think 1. is really expensive (can you find evidence). I don't see what is has to do with cache misses. Besides, in python you cannot guarantee much about mutability of objects. So you don't know if object passed in a queue is mutable or not, unless you restrict yourself to some very simlpe types (in which case there is no shared memory, since you only pass immutable objects).
If task X expects that task Y will mutate some object it has, it needs to go back to the source for every read. This means that if you do use mutation of some shared object for communication, it needs to be synchronised before every access. What this means for us is that every read from a possibly mutable object requires an acquire, and every write requires a release. It's as if every reference in the program is implemented with a volatile pointer. Even if the object is never mutated, there can be a lot of unnecessary bus chatter waiting for MESI to tell us so. -- William Leslie
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Thu, Jul 29, 2010 at 10:50 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 18:02, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 9:57 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
I claim that there are two alternatives in the face of one thread mutating an object and the other observing:
0. You can give up consistency and do fine-grained locking, which is reasonably fast but error prone, or 1. Expect python to handle all of this for you, effectively not making a change to the memory model. You could do this with implicit per-object locks which might be reasonably fast in the absence of contention, but not when several threads are trying to use the object.
Queues already are in a sense your per-object-lock, one-thread-mutating, but usually one thread has acquire semantics and one has release semantics, and that combination actually works. It's when you expect to have a full memory barrier that is the problem.
Come to think of it, you might be right Kevin: as long as only one thread mutates the object, the mutating thread never /needs/ to acquire, as it knows that it has the latest revision.
Have I missed something?
-- William Leslie
So my question is why you think 1. is really expensive (can you find evidence). I don't see what is has to do with cache misses. Besides, in python you cannot guarantee much about mutability of objects. So you don't know if object passed in a queue is mutable or not, unless you restrict yourself to some very simlpe types (in which case there is no shared memory, since you only pass immutable objects).
If task X expects that task Y will mutate some object it has, it needs to go back to the source for every read. This means that if you do use mutation of some shared object for communication, it needs to be synchronised before every access. What this means for us is that every read from a possibly mutable object requires an acquire, and every write requires a release. It's as if every reference in the program is implemented with a volatile pointer. Even if the object is never mutated, there can be a lot of unnecessary bus chatter waiting for MESI to tell us so.
I do agree there is an overhead. Can you provide some data how much this overhead is? Python is not a very simple language and a lot of things are complex and time consuming, so I wonder how it compares to locking per object.
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 29 July 2010 18:55, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 10:50 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
If task X expects that task Y will mutate some object it has, it needs to go back to the source for every read. This means that if you do use mutation of some shared object for communication, it needs to be synchronised before every access. What this means for us is that every read from a possibly mutable object requires an acquire, and every write requires a release. It's as if every reference in the program is implemented with a volatile pointer. Even if the object is never mutated, there can be a lot of unnecessary bus chatter waiting for MESI to tell us so.
I do agree there is an overhead. Can you provide some data how much this overhead is? Python is not a very simple language and a lot of things are complex and time consuming, so I wonder how it compares to locking per object.
It *is* locking per object, but you also spend time looking for the data if someone else has invalidated your cache line. Come to think of it, that isn't as bad as it first seemed to me. If the sender never mutates the object, it will Just Work on any machine with a fairly flat cache architecture. Sorry. Carry on. -- William Leslie
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
On Thu, Jul 29, 2010 at 15:15, William Leslie <william.leslie.ttg@gmail.com> wrote:
On 29 July 2010 18:55, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Thu, Jul 29, 2010 at 10:50 AM, William Leslie <william.leslie.ttg@gmail.com> wrote:
If task X expects that task Y will mutate some object it has, it needs to go back to the source for every read. This means that if you do use mutation of some shared object for communication, it needs to be synchronised before every access. What this means for us is that every read from a possibly mutable object requires an acquire, and every write requires a release. It's as if every reference in the program is implemented with a volatile pointer. Even if the object is never mutated, there can be a lot of unnecessary bus chatter waiting for MESI to tell us so.
I do agree there is an overhead. Can you provide some data how much this overhead is? Python is not a very simple language and a lot of things are complex and time consuming, so I wonder how it compares to locking per object.
Below I try to prove that locking is still too expensive, even for an interpreter. Also, for many things the clever optimizations you do allow making those costs small, at least for the average case / fast path. I have been taught to consider clever optimizations as required. With JIT compilation, specialization and shadow classes, are method calls much more expensive than a guard and (if no inlining is done, as might happen in PyPy in the worst case for big functions) an assembler 'call' opcode, and possibly stack shuffling? How many cycles is that? How more expensive is that than optimized JavaScript (which is not far from C, the only difference being the guard)? You can assume the case of plain calls without keyword arguments and so on (and with inlining, keyword arguments should pay no runtime cost). Also, the free threading patches which tried removing the GIL gave an unacceptable (IIRC 2x) slowdown to CPython in the old days of CPython 1.5. And I don't think they tried to lock every object, just what you need to lock (which included refcounts).
It *is* locking per object, but you also spend time looking for the data if someone else has invalidated your cache line.
That overhead is already there in locking per object, I think (locking can be much more expensive than a cache miss, see below). However, locking per object does not prevent race conditions unless you make atomic regions as big as actually needed (locking per statement does not work), it just prevents data races (a conflict between a write and a memory operation which are not synchronized between each other). And you can't extend atomic regions indefinitely, as that implies starvation. Even software transactional memory requires the programmer to allocate which regions have to be atomic. Given the additional cost (discussed elsewhere in this mail), and given that there is not much benefit, I think locking-per-object is not worth it (but I'd still love to know more about why the effort on python-safethread was halted).
Come to think of it, that isn't as bad as it first seemed to me. If the sender never mutates the object, it will Just Work on any machine with a fairly flat cache architecture.
You first wrote: "The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower." This is not "locking per object", it is just semantically close to it, and becomes equivalent if only one thread has a reference at any time. They are very different though performance-wise, and each of them is better for some usages. In the Linux kernel (which I consider quite authoritative here, on what you can do in C) both are used for valid performance reasons, and a JIT compiler could choice between them. Here, first I describe the two alternatives mentioned. Finally, I go to the combination for the "unshared case". - What you first described (memory barriers or uncached R/Ws) can be faster for small updates, depending on the access pattern. An uncached memory area does not disturb other memory traffic, unlike memory barriers which are global, but I don't think an unprivileged process is allowed to obtain one (by modifying MSRs or PATs, for x86). Cost: each memory op goes to main memory and is thus as slow as a cache miss (hundreds of clock cycles). When naively reading a Python field, many such reads can be possible, but a JIT compiler can bring it down to the equivalent of a C access with shadow classes and specialization, and this would pay even more here (V8 does it for JavaScript and I think PyPy already does most or all of it). - Locking per object (monitors): slow upfront, but you can do each r/w out of your cache, so if the object is kept locked for some time, this is more efficient. How slow? A system call to perform locking can cost tens of thousands of cycles. But Java locks, and nowadays even Linux futexes (and Windows locks), perform everything in userspace in as many cases as possible (the slowpath is when there is actually contention on the lock, but it's uncommon with locking-per-object). I won't sum up here the literature on this. - Since no contention is expected here, a simple couple of memory barrier is needed on send/receive (a write barrier for send, a read one for receive, IIRC). Allowing read-only access to another thread already brings back to a mixture of the above two solutions. However, in the 1st solution, using memory barriers, you'd need a write barrier for every write, but you could save on read barriers. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/79679073d76b29e22a54de31f845ea6d.jpg?s=120&d=mm&r=g)
On 30 July 2010 06:53, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
Come to think of it, that isn't as bad as it first seemed to me. If the sender never mutates the object, it will Just Work on any machine with a fairly flat cache architecture.
You first wrote: "The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower." This is not "locking per object", it is just semantically close to it, and becomes equivalent if only one thread has a reference at any time.
Yes, direct memory access was misdirection (sorry), as the cache already handles consistency even in NUMA systems of the same size that sit on most desktops today, and most significantly you still need to lock objects in many cases, such as looking up an entry in a dict, which can change size while probing. Not only are uncached accesses needlessly slow in the typical case, but they are not sufficient to ensure consistency of some resizable rpython data structures. -- William Leslie
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
I claim that there are two alternatives in the face of one thread mutating an object and the other observing: Well, I did consider the possibility of one thread being able to change, the others observe, but I have no idea if that is too complicate like you are suggesting. However, that is not even necessary. An even more limited form, would work fine (at least for me):
Two possible modes: Read/Write from 1 thread: * ONLY one thread can change and observe(read) -- no other threads have access of any kind or even know of its existence until you transfer control to another thread (then only the thread you transferred control has acces). (Optional) read only from all threads: * Optionally, you could have objects that are in read only mode and all threads can observe it. To make things easier, maybe special GIL-free threads could be added. (They would still be OS-level threads, but with special properties in Python.) These threads would have the property that they could ONLY access data stored in the special object store to which they have read/write privilege. They can't access other objects not in the special store. As a result, these special threads would be free of the GIL and could run in parallel.
Queues already are in a sense your per-object-lock, one-thread-mutating, but usually one thread has acquire semantics and one has release semantics, and that combination actually works. It's when you expect to have a full memory barrier that is the problem.
Now you brought up something interesting: queues To be honest something like queues and pipes would good enough for my purposes -- if they used shared memory. Currently, the implemenation of queues and pipes in the multiprocessing module seems rather costly as they use processes, and require copying data back and forth. In particular, what would be useful: * A queue that holds self-contained Python objects (with no pointers/references to other data not in the queue so as to prevent threading issues) * The queue can be accessed by all special threads simultaneously (in parallel). You would only need locks around queue operations, but that is pretty easy to do -- unless there is some hidden Interpreter problem that would make this easy task hard. * Streaming buffers -- like a file buffer or something similar, so you can send data from one thread to another as it comes in (when you don't know when it will end or it may never end). Only two threads have access: one to put data in, the other to extract it.
0. You can give up consistency and do fine-grained locking, which is reasonably fast but error prone, or 1. Expect python to handle all of this for you, effectively not making a change to the memory model. You could do this with implicit per-object locks which might be reasonably fast in the absence of contention, but not when several threads are trying to use the object.
...
Come to think of it, you might be right Kevin: as long as only one thread mutates the object, the mutating thread never /needs/ to acquire, as it knows that it has the latest revision.
Have I missed something?
I'm afraid I don't know enough about Python's Interpreter to say much. The only way would be for me to do some studying on interpreters/compilers and get digging into the codebase -- and I'm not sure how much time I have to do that right now. :) Perhaps the part about one thread only having read & write changes the situation? One possible implemenation might be similar to how POSH does it: Now, I'm not suggesting this, because I know enough to say it is possible, but just to put something out there that might work. Create a special virtual memory address or lookup table for each thread. When you assign a read+write object to a thread, it gets added to the virtual address/memory table. Optinally, it could be up to the programmer to make sure they don't try to access data from a thread that does not have ownership/control of that object. If a programmer does try to access it, it would fail as the memory address would point to nowhere/bad data/etc.... Of course, there are probably other, better ways to do it that are not as fickle as this... but I don't know if the limitations of the Python Interpreter and GIL would allow better methods.
![](https://secure.gravatar.com/avatar/aa2d0ea46169b401151a00b8b97de928.jpg?s=120&d=mm&r=g)
Would comments from a project using this approach in real systems be of interest/use/help? Whilst I didn't know about Morrison's FBP (Balzer's work predates him btw - don't listen to hype) I had heard of (and played with) Occam among other more influential things, and Kamaelia is a real tool. Also there is already a pre-existing FBP tool for Stackless, and then historically there's also MASCOT & friends. It just looks to me that you're tieing yourself up in knots over things that aren't problems, when there are some things which could be useful (in practice) & interesting in this space. Oh, incidentally, Mini Axon is a toy/teaching/testing system - as the name suggests. The main Axon is more complete -- in the areas we've needed - it's been driven by real system needs. (for those who don't know me, Kamaelia is my project, I don't bite, but I do sometimes talk or type fast) Regards, Michael Sparks -- http://www.kamaelia.org/PragmaticConcurrency.html http://yeoldeclue.com/blog On 7/29/10, Kevin Ar18 <kevinar18@hotmail.com> wrote:
As a followup to my earlier post: "pre-emptive micro-threads utilizing shared memory message passing?"
I am actually finding that the biggest hurdle to accomplishing what I want is the lack of ANY type of shared memory -- even if it is limited. I wonder if I might ask a question:
Would the following be a possible way to offer a limited type of shared memory:
Summary: create a system very, very similar to POSH, but with differences:
In detail, here's what I mean: * unlike POSH, utilize OS threads and shared memory (not processes) * Create a special shared memory location where you can place Python objects * Each Python object you place into this location can only be accessed (modified) by 1 thread. * You must manually assign ownership of an object to a particular thread. * The thread that "owns" the object is the only one that can modify it. * You can transfer ownership to another thread (but, as always only the owner can modify it).
* There is no GIL when a thread interacts with these special objects. You can have true thread parallelism if your code uses a lot of these special objects. * The GIL remains in place for all other data access. * If your code has a mixture of access to the special objects and regular data, then once you hit a point where a thread starts to interact with data not in the special storage, then that thread must follow GIL rules.
Granted, there might be some difficulty with the GIL part... but I thought I might ask anyways. :)
Date: Wed, 28 Jul 2010 22:54:39 +1000 Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? From: william.leslie.ttg@gmail.com To: kevinar18@hotmail.com CC: pypy-dev@codespeak.net
On 28 July 2010 04:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages.
Does anyone know if there is a central resource for incompatible python memory model proposals? I know of Jython, Python-Safethread, and Mont-E.
I do like the idea of MiniAxon, but let me mention a topic that has slowly been bubbling to the front of my mind for the last few months.
Concurrency in the face of shared mutable state is hard. It makes it trivial to introduce bugs all over the place. Nondeterminacy related bugs are far harder to test, diagnose, and fix than anything else that I would almost mandate static verification (via optional typing, probably) of task noninterference if I was moving to a concurrent environment with shared mutable state. There might be a reasonable middle ground where, if a task attempts to violate the required static semantics, it fails dynamically. At least then, latent bugs make plenty of noise. An example for MiniAxon (as I understand it, which is not very well) would be verification that a "task" (including functions that the task calls) never closes over and yields the same mutable objects, and never mutates globally reachable objects.
I wonder if you could close such tasks off with a clever subclass of the proxy object space that detects and rejects such memory model violations? With only semantics that make the program deterministic?
The moral equivalent would be cooperating processes with a large global (or not) shared memory store for immutable objects, queues for communication, and the additional semantic that objects in a queue are either immutable or the queue holds their only reference. The trouble is that it is so hard to work out what immutable really means. Non-optional annotations would be not very pythonian.
-- William Leslie
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
On Tue, Jul 27, 2010 at 20:20, Kevin Ar18 <kevinar18@hotmail.com> wrote:
I won't even bother giving individual replies. It's going to take me some time to go through all that information on the GIL, so I guess there's no much of a reply I can give anyways. :) Let me explain what this is all about in greater detail.
BTW, if there are more links on the GIL, feel free to post.
Anonymous memory-mapped regions would work, with a suitable data abstraction. Or even memory-mapped files, which aren't really all that different on systems anymore. I considered that... however, that would mean writing a significant library to convert Python data types to C/machine types and I wasn't looking forward to that prospect... although after some experimenting, maybe I will find that it won't be that big a deal for my particular situation.
I am attempting to experiment with FBP - Flow Based Programming (http://www.jpaulmorrison.com/fbp/ and book: http://www.jpaulmorrison.com/fbp/book.pdf) There is something very similar in Python: http://www.kamaelia.org/MiniAxon.html Also, there are some similarities to Erlang - the share nothing memory model... and on some very broad levels, there are similarities that can be found in functional languages. Except for the "visual programming" part, the general idea you describe stems from CSP (Communicating Sequential Processes) and is also found at least in the Scala actor library and in Google's Go with goroutines.
In both languages you can easily pretend that no memory is shared by avoiding to share any pointers (unlike C, even buggy code can't modify a pointer which wasn't shared), and Go recommends programming this way. A difference is that this is a convention. For the "visual programming", it looks like a particular case of what the Eclipse Modeling Framework is doing (they allow you to define types of diagrams, called metamodels, and a way to convert them to code, and generate a diagram editor and other support stuff. I'm not an expert on that).
From what you describe, FBP seems to give nothing new, except the combination among "visual programming" with this idea. Disclaimer: I did not read the book.
Consider p74 and p75 of the FBP book (http://www.jpaulmorrison.com/fbp/book.pdf). Programs essentially consist of many "black boxes" connected together. A box receives data, processes it and passes it along to another box, to output or drops/deletes it. Each box, is like a mini-program written in a traditional programming language (like C++ or Python).
The process of connecting the boxes together was actually designed to be programmed visually, as you can see from the examples in the book (I have no idea if it works well, as I am merely starting to experiment with it).
Each box, being a self contained "program," the only data it has access to is 3 parts: (1) it's own internal variables (2) The "in ports" These are connections from other boxes allowing the box to receive data to be processed (very similar to the arguments in a function call) (3) The "out ports" After processing the data, the box sends results to various "out ports" (which, in turn, go to anther box's "in port" or to system output). There is no "return" like in functions... and a box can continually generate many pieces of data on the "out ports", unlike a function which only generates one return.
------------------------ At this point, my understanding of the FBP concept is extremely limited. Unfortunately, the author does not have very detailed documentation on the implementation details. So, I am going to try exploring the concept on my own and see if I can actually use it in some production code.
Implementation of FBP requires a custom scheduler for several reasons: (1) A box can only run if it has actual data on the "in port(s)" Thus, the scheduler would only schedule boxes to run when they can actually process some data. (2) In theory, it may be possible to end up with hundreds or thousands of these light weight boxes. Using heavy-weight OS threads or processes for every one is out of the question.
The Kamaelia website describes a simplistic single-threaded way to write a scheduler in Python that would work for the FBP concept (even though they never heard of FBP when they designed Kamaelia). Based on that, it seems like writing a simple scheduler would be rather easy:
In a perfect world, here's what I might do: * Assume a quad core cpu (1) Spawn 1 process (2) Spawn 4 threads & assign each thread to only 1 core -- in other words, don't let the OS handle moving threads around to different cores (3) Inside each thread, have a mini scheduler that switches back and forth between the many micro-threads (or "boxes") -- note that the OS should not handle any of the switching between micro-threads/boxes as it does it all wrong (and to heavyweight) for this situation. (4) Using a shared memory queue, each of the 4 schedulers can get the next box to run... or add more boxes to the schedule queue.
Most of this is usual or standard - even if somebody possibly won't set thread-CPU affinity, possibly because they don't know about the syscalls to do it, i.e. sched_setaffinity. IIRC, this was not mentioned in the paper I read about the Scala actor library. Look for 'N:M threading library' (without quotes) on Google.
(5) Each box has access to its "in ports" and "out ports" only -- and nothing else. These can be implemented as shared memory for speed.
Some notes: Garbage Collection - I noticed that one of the issues mentioned about the GIL was garbage collection. Within the FBP concept, this MIGHT be easily solved: (a) only 1 running piece of code (1 box) can access a piece of data at a time, so there is no worries about whether there are dangling pointers to the var/object somewhere, etc...
(b) data must be manually "dropped" inside a box to get rid of it; thus, there is no need to go checking for data that is not used anymore
A "piece of data" can point to other objects, and the pointer can be modified. So you need GC anyway: having that, requiring data to be dropped explicitly seems just an annoyance (there might be deeper reasons, however).
Threading protection - In theory, there is significantly less threading issues since: (a) only one box can control/access data at a time (b) the only place where there is contention is when you push/pop from the in/out ports ... and that is trivial to protect against. Agreed. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Tue, Jul 27, 2010 at 4:09 AM, Kevin Ar18 <kevinar18@hotmail.com> wrote:
Might as well warn you: This is going to be a rather long post. I'm not sure if this is appropriate to post here or if would fit right in with the mailing list. Sorry, if it is the wrong place to post about this.
This is a relevant list for some of questions below. I'll try to answer them.
Quick Question: Do queues from the multiprocessing module use shared memory? If the answer is YES, you can just skip this section, because that would solve this particular problem.
PyPy has no multiprocessing module so far (besides, I think it's an ugly hack, but that's another issue).
Does PyPy have any other options for me?
Right now, no. But there are ways in which you can experiment. Truly concurrent threads (depends on implicit vs explicit shared memory) might require a truly concurrent GC to achieve performance. This is work (although not as big as removing refcounting from CPython for example).
True Pre-emptive scheduling?
----------------------------
Any way to get pre-emptive micro-threads? Stackless (the real Stackless, not the one in PyPy) has the ability to suspend them after a certain number of interpreter instructions; however, this is prone to problems because it can run much longer than expected. Ideally, I would like to have true pre-emptive scheduling using hardware interrupts based on timing or CPU cycles (like the OS does for real threads).
I am currently not aware of any way to achieve this in CPython, PyPy, Unladen Swallow, Stackless, etc....
Sounds relatively easy, but you would need to write this part in RPython (however, that does not mean you get rid of GIL).
Are there detailed docs on why the Python GIL exists? ----------------------------------------------------- I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it". It may be possible that my particular usage would not require the GIL. However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against. Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?
The short answer is "yes". The long answer is that it's much easier to write interpreter assuming GIL is around. For fine-grained locking to work and be efficient, you would need: * Some sort of concurrent GC (not specifically running in a separate thread, but having different pools of memory to allocate from) * Possibly a JIT optimization that would remove some locking. * The forementioned locking, to ensure that it's not that easy to screw things up. So, in short, "work".
![](https://secure.gravatar.com/avatar/a3a676c96a88feb813010e67af012ca0.jpg?s=120&d=mm&r=g)
Hi all! I am possibly interested in doing work on this, even if not in the immediate future. On Tue, Jul 27, 2010 at 11:48, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Tue, Jul 27, 2010 at 4:09 AM, Kevin Ar18 <kevinar18@hotmail.com> wrote:
Truly concurrent threads (depends on implicit vs explicit shared memory) might require a truly concurrent GC to achieve performance. This is work (although not as big as removing refcounting from CPython for example).
Are there detailed docs on why the Python GIL exists? ----------------------------------------------------- I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it". It may be possible that my particular usage would not require the GIL. However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against. Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?
The short answer is "yes". The long answer is that it's much easier to write interpreter assuming GIL is around. For fine-grained locking to work and be efficient, you would need:
* The forementioned locking, to ensure that it's not that easy to screw things up. I've wondered around the guarantees we need to offer to the programmer, and my guess was that Jython's memory model is similar. I've been concentrating on the dictionary of objects, on the assumption that lists and most other built-in structures should be locked by the programmer in case of concurrent modifications.
However, we don't want to require locking to support something like: Thread 1: obj.newmember=1; Thread 2: a = obj.oldmember; Looking for Jython memory model on Google produces some garbage and then this document from Unladen Swallow: http://code.google.com/p/unladen-swallow/wiki/MemoryModel It implicitly agrees on what's above (since Jython and IronPython both use thread-safe dictionaries), and then delves into issues about allowed reorderings. However, it requires that even racy code does not make the interpreter crash.
* Possibly a JIT optimization that would remove some locking. Any more specific ideas on this? * Some sort of concurrent GC (not specifically running in a separate thread, but having different pools of memory to allocate from)
Among all points, this seems the easiest design-wise. Having per-thread pools is nowadays standard, so it's _just_ work (as opposed to 'complicated design'). Parallel GCs become important just when lots of garbage must be reclaimed. A GC is called concurrent, rather than parallel, when it runs concurrently with the mutator, and this usually reduces both pause times and throughput, so you probably don't want this as default (it is useful for particular programs, such as heavily interactive programs or videogames, I guess), do you? More details are here: http://www.ibm.com/developerworks/java/library/j-jtp11253/ The trick used in the (mostly) concurrent collector of Hotspot seems interesting: it uses two short-stop-the-world phases and lets the program run in between. I think I'll look for a paper on it. Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/7f1a0b5a9424e7a2293b96dd89477cda.jpg?s=120&d=mm&r=g)
On Tue, Jul 27, 2010 at 17:07 +0200, Maciej Fijalkowski wrote:
On Tue, Jul 27, 2010 at 4:36 PM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
Hi all!
I am possibly interested in doing work on this, even if not in the immediate future.
Well, talk is cheap. Would be great to see some work done of course.
Well, I think it can be useful to state intentions and interest. At least for my projects i feel a difference if people express interest (even through negative feedback or broken code) or if they are indifferent, not saying or doing anything. best, holger
![](https://secure.gravatar.com/avatar/e2ebb4713ff59479aa115031c104a324.jpg?s=120&d=mm&r=g)
A much shorter version of the Jython memory model can be found in my book: http://jythonpodcast.hostjava.net/jythonbook/en/1.0/Concurrency.html#python-... In general, I would think the coroutine mechanism being implemented by Lukas Stadler for the MLVM version of the hotspot JVM might be a good option; you can directly control the scheduling, although I don't think you change the mapping from one hardware thread to another. (That's probably not interesting.) There are good results with JRuby, it would be nice to replicate with Jython - and it should be really straightforward to do that. See http://classparser.blogspot.com/ - Jim On Tue, Jul 27, 2010 at 10:05 AM, holger krekel <holger@merlinux.eu> wrote:
On Tue, Jul 27, 2010 at 17:07 +0200, Maciej Fijalkowski wrote:
On Tue, Jul 27, 2010 at 4:36 PM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
Hi all!
I am possibly interested in doing work on this, even if not in the immediate future.
Well, talk is cheap. Would be great to see some work done of course.
Well, I think it can be useful to state intentions and interest. At least for my projects i feel a difference if people express interest (even through negative feedback or broken code) or if they are indifferent, not saying or doing anything.
best, holger _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
Truly concurrent threads (depends on implicit vs explicit shared memory) might require a truly concurrent GC to achieve performance. This is work (although not as big as removing refcounting from CPython for example).
Are there detailed docs on why the Python GIL exists? ----------------------------------------------------- I don't mean trivial statements like "because of C extensions" or "because the interpreter can't handle it". It may be possible that my particular usage would not require the GIL. However, I won't know this until I can understand what threading problems the Python interpreter has that the GIL was meant to protect against. Is there detailed documentation about this anywhere that covers all the threading issues that the GIL was meant to solve?
The short answer is "yes". The long answer is that it's much easier to write interpreter assuming GIL is around. For fine-grained locking to work and be efficient, you would need:
* The forementioned locking, to ensure that it's not that easy to screw things up. I've wondered around the guarantees we need to offer to the programmer, and my guess was that Jython's memory model is similar. I've been concentrating on the dictionary of objects, on the assumption that lists and most other built-in structures should be locked by the programmer in case of concurrent modifications.
However, we don't want to require locking to support something like: Thread 1: obj.newmember=1; Thread 2: a = obj.oldmember;
Looking for Jython memory model on Google produces some garbage and then this document from Unladen Swallow: http://code.google.com/p/unladen-swallow/wiki/MemoryModel It implicitly agrees on what's above (since Jython and IronPython both use thread-safe dictionaries), and then delves into issues about allowed reorderings. However, it requires that even racy code does not make the interpreter crash.
I guess the main restraint is "interpreter should not crash" indeed.
* Possibly a JIT optimization that would remove some locking. Any more specific ideas on this?
Well, yes. Determining when object is local so you don't need to do any locking, even though it escapes (this is also "just work", since it has been done before).
* Some sort of concurrent GC (not specifically running in a separate thread, but having different pools of memory to allocate from)
Among all points, this seems the easiest design-wise. Having per-thread pools is nowadays standard, so it's _just_ work (as opposed to 'complicated design'). Parallel GCs become important just when lots of garbage must be reclaimed. A GC is called concurrent, rather than parallel, when it runs concurrently with the mutator, and this usually reduces both pause times and throughput, so you probably don't want this as default (it is useful for particular programs, such as heavily interactive programs or videogames, I guess), do you?
I guess I meant parallel then.
More details are here: http://www.ibm.com/developerworks/java/library/j-jtp11253/
The trick used in the (mostly) concurrent collector of Hotspot seems interesting: it uses two short-stop-the-world phases and lets the program run in between. I think I'll look for a paper on it.
Would be interested in that.
Cheers, -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/
![](https://secure.gravatar.com/avatar/a38348763429b04b47da75e9dbdd9298.jpg?s=120&d=mm&r=g)
On 07/27 11:48, Maciej Fijalkowski wrote:
Right now, no. But there are ways in which you can experiment. Truly concurrent threads (depends on implicit vs explicit shared memory) might require a truly concurrent GC to achieve performance. This is work (although not as big as removing refcounting from CPython for example).
Would starting to remove the GIL then be a useful project for someone (like me, for example) to undertake? It might be a good start to experimentation with other kinds of concurrency. I've been interested in Software Transactional Memory (http://en.wikipedia.org/wiki/Software_transactional_memory). -- Evan Cofsky <evan@theunixman.com>
![](https://secure.gravatar.com/avatar/bfc96d2a02d9113edb992eb96c205c5a.jpg?s=120&d=mm&r=g)
On Fri, Jul 30, 2010 at 9:36 PM, Evan Cofsky <evan@theunixman.com> wrote:
On 07/27 11:48, Maciej Fijalkowski wrote:
Right now, no. But there are ways in which you can experiment. Truly concurrent threads (depends on implicit vs explicit shared memory) might require a truly concurrent GC to achieve performance. This is work (although not as big as removing refcounting from CPython for example).
Would starting to remove the GIL then be a useful project for someone (like me, for example) to undertake? It might be a good start to experimentation with other kinds of concurrency. I've been interested in Software Transactional Memory (http://en.wikipedia.org/wiki/Software_transactional_memory).
-- Evan Cofsky <evan@theunixman.com>
I think removing GIL is not a good place to start. It's far too complex without knowing codebase (it's fairly complex with knowing codebase). There are many related projects, which are smaller in size and eventually might lead to having some idea how to remove the GIL. If you're interested, come to #pypy on IRC to discuss. Cheers, fijal
![](https://secure.gravatar.com/avatar/a38348763429b04b47da75e9dbdd9298.jpg?s=120&d=mm&r=g)
On 07/30 21:40, Maciej Fijalkowski wrote:
If you're interested, come to #pypy on IRC to discuss.
Sounds reasonable enough. I'll hang out on #pypy and see what happens. Thanks -- Evan Cofsky <evan@tunixman.com>
![](https://secure.gravatar.com/avatar/9e08987176fbb4776b8665c17e735cb8.jpg?s=120&d=mm&r=g)
Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless: - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter. - Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads. The API is described here in french but it's pretty straightforward: https://w3.mutehq.net/wiki/maitrise/API_DStackless The code is available here (Just click on the Download link next to the trunk folder): https://w3.mutehq.net/websvn/wildchild/dstackless/trunk/ You need pypy-c built with --stackless. The code is a bit buggy right now though...
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed.
It might be of use. Thanks for the heads up. I do have several questions: 1) Is it PyPy's stackless module or Stackless Python (stackless.com)? Or are they the same module? 2) Do you have a non-https version of the site or one with a publically signed certificate? P.S. You can send your reply over private email if you want, so as to not bother the list. :) Date: Wed, 28 Jul 2010 15:32:38 -0400 Subject: Re: [pypy-dev] pre-emptive micro-threads utilizing shared memory message passing? From: glavoie@gmail.com To: kevinar18@hotmail.com CC: pypy-dev@codespeak.net Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless: - Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter.- Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads. The API is described here in french but it's pretty straightforward:https://w3.mutehq.net/wiki/maitrise/API_DStackless The code is available here (Just click on the Download link next to the trunk folder):https://w3.mutehq.net/websvn/wildchild/dstackless/trunk/ You need pypy-c built with --stackless. The code is a bit buggy right now though...
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless:
- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter. - Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter
With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads.
Was able to have a look at the API... If others don't mind my asking this on the mailing list: * .send() and .receive() What type of data can you send and receive between the tasklets? Can you pass entire Python objects? * .send() and .receive() memory model When you send data between tasklets (pass messages) or whateve you want to call it, how is this implemented under the hood? Does it use shared memory under the hood or does it involve a more costly copying of the data? I realize that if it is on another machine you have to copy the data, but what about between two threads? You mentioned PyPy's proxy object.... guess I'll need to read up on that.
![](https://secure.gravatar.com/avatar/9e08987176fbb4776b8665c17e735cb8.jpg?s=120&d=mm&r=g)
Sorry for the late answer, I was unavailable in the last few days. About send() and receive(), it depends on if the communication is local or not. For a local communication, anything can be passed since only the reference is sent. This is the base model for Stackless channels. For a remote communication (between two interpreters), any picklable object (a copy will then be made) and it includes channels and tasklets (for which a reference will automatically be created). The use of the PyPy proxy object space is to make remote communication more Stackless like by passing object by reference. If a ref_object is made, only a reference will be passed when a tasklet is moved or the object is sent on a channel. The object always resides where it was created. A move() operation will also be implemented on those objects so they can be moved around like tasklets. I hope it helps, Gabriel 2010/7/29 Kevin Ar18 <kevinar18@hotmail.com>
Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless:
- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter. - Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter
With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads.
Was able to have a look at the API... If others don't mind my asking this on the mailing list:
* .send() and .receive() What type of data can you send and receive between the tasklets? Can you pass entire Python objects?
* .send() and .receive() memory model When you send data between tasklets (pass messages) or whateve you want to call it, how is this implemented under the hood? Does it use shared memory under the hood or does it involve a more costly copying of the data? I realize that if it is on another machine you have to copy the data, but what about between two threads? You mentioned PyPy's proxy object.... guess I'll need to read up on that. _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
-- Gabriel Lavoie glavoie@gmail.com
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
Note: Gabriel, do you think we should discuss this on another mailing list (or in private) as I'm not sure this related to PyPy dev anymore? Anywyas, what are your future plans for the project? Is it just an experiment for school ... maybe in the hopes that others would maintaining it if it was found to be interesting? ... are you planning actual future development, maintenance, promotion of it yourself? ----------- On a personal note... the concept has a lot of similarities to what I am exploring. However, I would have to make so many additional modifications. Perhaps you can give some thoughts on whether it would take me a long time to add such things? Some examples: * Two additional message passing styles (in addition to your own) Queues - multiple tasklets can push onto queue, only one tasklet can pop.... multiple tasklets can access the property to find out if there is any data in the queue. Queues can be set to an infite size or set with a max # of entries allowed. Streams - I'm not sure of the exact name, but kind of like an infinite stream/buffer ... useful for passing infinite amounts of data. Only one tasklet can write/add data. Only one tasklet can read/extract data. * Message passing When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them. In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream. * Scheduler For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler. ------------ On another note, I am curious how you handled the issue of "nested" objects. Consider send() and receive() that you use to pass objects around in your project. Am I correct in that these objects cannot contain references outside of themselves? Also, how do you handle extracting out of the tree and making sure there are not references outside the object? For example, consider the following object, where "->" means it has a reference to that object Object 1 -> Object 2 Object 2 -> Object 3 Object 2 -> Object 4 Object 4 -> Object 2 Now, let's say I have a tasklet like the following: .... -> incoming data = pointer/reference to Object 1 1. read incoming data (get Object 1 reference) 2. remove Object 3 3. send Object 3 to tasklet B 4. send Object 1 to tasklet C Result: tasklet B now has this object: pointer/reference to Object 1, which contains the following tree: Object 1 -> Object 2 Object 2 -> Object 4 Object 4 -> Object 2 tasklet C now has this object: pointer/reference to Object 3, which contains the following tree: Object 3 On the other hand, consider the following scenario: 1. read incoming data (get Object 1 reference) 2. remove Object 4 ERROR: this would not be possible, as it refers to Object 2
Sorry for the late answer, I was unavailable in the last few days.
About send() and receive(), it depends on if the communication is local or not. For a local communication, anything can be passed since only the reference is sent. This is the base model for Stackless channels. For a remote communication (between two interpreters), any picklable object (a copy will then be made) and it includes channels and tasklets (for which a reference will automatically be created).
The use of the PyPy proxy object space is to make remote communication more Stackless like by passing object by reference. If a ref_object is made, only a reference will be passed when a tasklet is moved or the object is sent on a channel. The object always resides where it was created. A move() operation will also be implemented on those objects so they can be moved around like tasklets.
I hope it helps,
Gabriel
2010/7/29 Kevin Ar18>
Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless:
- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter. - Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter
With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads.
Was able to have a look at the API... If others don't mind my asking this on the mailing list:
* .send() and .receive() What type of data can you send and receive between the tasklets? Can you pass entire Python objects?
* .send() and .receive() memory model When you send data between tasklets (pass messages) or whateve you want to call it, how is this implemented under the hood? Does it use shared memory under the hood or does it involve a more costly copying of the data? I realize that if it is on another machine you have to copy the data, but what about between two threads? You mentioned PyPy's proxy object.... guess I'll need to read up on that. _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
-- Gabriel Lavoie glavoie@gmail.com
![](https://secure.gravatar.com/avatar/9e08987176fbb4776b8665c17e735cb8.jpg?s=120&d=mm&r=g)
I don't mind replying to the mailing list unless it annoys someone? Maybe some people could be interested by this discussion. You have a lot of questions! :) My answers are inline. 2010/8/5 Kevin Ar18 <kevinar18@hotmail.com>
Note: Gabriel, do you think we should discuss this on another mailing list (or in private) as I'm not sure this related to PyPy dev anymore?
Anywyas, what are your future plans for the project? Is it just an experiment for school ... maybe in the hopes that others would maintaining it if it was found to be interesting? ... are you planning actual future development, maintenance, promotion of it yourself?
Based on the interest and time I'll and and other people will have I plan to debug this as much as possible. If people are interested to join in after my thesis, I'll be more than open to welcome then in the project. Right now, I'm writing my report and I'm also looking for a job. I won't have much time to touch again to the code before next month to prepare it for my presentation, along with a lot of examples and use cases.
-----------
On a personal note... the concept has a lot of similarities to what I am exploring. However, I would have to make so many additional modifications. Perhaps you can give some thoughts on whether it would take me a long time to add such things?
Allright, my plan was to make all the needed lower level constructs that can be used to build more complex things. For example, a mix of tasklet and sync channels could be wrapped in an API to create async channels. I know this is far from complete and I have a few ideas on how it could be improved in the future but it's currently not needed for my project. For now, the idea was to stay as close as possible to standard Stackless Python and only add the needed APIs and functionalities to support distributing tasklets between multiple interpreters.
Some examples:
* Two additional message passing styles (in addition to your own) Queues - multiple tasklets can push onto queue, only one tasklet can pop.... multiple tasklets can access the property to find out if there is any data in the queue. Queues can be set to an infite size or set with a max # of entries allowed.
This could easily be implemented using a standard channel and by starting multiple tasklets to send data. With some helper methods on a channel it could be possible to know how many tasklets are waiting to send their data. A channel already have a built-in queue for send/receive requests. This queue contains a list of all tasklets waiting for a send/receive operation. Tasklets are supposed to be lightweight enough to support something like this.
Streams - I'm not sure of the exact name, but kind of like an infinite
stream/buffer ... useful for passing infinite amounts of data. Only one tasklet can write/add data. Only one tasklet can read/extract data.
Like a UNIX pipe()? Async? Again, some code wrapping standard channels could be used for this.
* Message passing When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them.
In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream.
Sorry, I don't really understand what you're trying to explain here. Maybe an example could be helpful? :)
* Scheduler For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler.
Couldn't all those ports (channels) be read one at a time, then the processing could be done? I don't exactly see the need to play with the scheduler. Channels are blocking. A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available.
------------ On another note, I am curious how you handled the issue of "nested" objects. Consider send() and receive() that you use to pass objects around in your project. Am I correct in that these objects cannot contain references outside of themselves? Also, how do you handle extracting out of the tree and making sure there are not references outside the object?
Right now, I did not really dig too far with this problem. With a local communication, a reference to the object is sent through a channel. The receiver tasklet will have the same access to the object and all the sub-object as the sender tasklet. For remote communications, pickling is involved. The object to send must be picklable. It excludes any I/O object unless the programmer creates its own pickling protocol for those. A copy of all the object tree will then be made. Sometime it's good (small objects), sometime it's bad (really complex, big objects, I/O objects, etc.). This is why I added the concept of ref_object() using PyPy's proxy object space. For such objects, a proxy can be made and only a reference object will be sent to the remote side. This object will have the same type as the original object but all operations will be forwarded to the host node. All replies will also be wrapped by proxies when sent back to the remote reference object. The only case where a proxy object is not created is with atomic types (string, int, float, etc). It's useless for those because they are immutable anyway. A remote access to those would introduce useless latency. With ref_object(), the object tree always stay on the initial node. A move() operation will also be added to those ref_object()s to be able to move them between interpreters if needed.
For example, consider the following object, where "->" means it has a reference to that object
Object 1 -> Object 2
Object 2 -> Object 3
Object 2 -> Object 4
Object 4 -> Object 2
Now, let's say I have a tasklet like the following:
.... -> incoming data = pointer/reference to Object 1
1. read incoming data (get Object 1 reference) 2. remove Object 3 3. send Object 3 to tasklet B 4. send Object 1 to tasklet C
Result: tasklet B now has this object: pointer/reference to Object 1, which contains the following tree:
Object 1 -> Object 2
Object 2 -> Object 4 Object 4 -> Object 2
tasklet C now has this object: pointer/reference to Object 3, which contains the following tree: Object 3
I think you swapped tasklet B and tasklet C for the end result! ;)
On the other hand, consider the following scenario:
1. read incoming data (get Object 1 reference) 2. remove Object 4 ERROR: this would not be possible, as it refers to Object 2
Why isn't it possible? By removing "Object 4" I guess you mean removing this link: Object 2 -> Object 4? This is the only way Object 4 could be removed.
Sorry for the late answer, I was unavailable in the last few days.
About send() and receive(), it depends on if the communication is local or not. For a local communication, anything can be passed since only the reference is sent. This is the base model for Stackless channels. For a remote communication (between two interpreters), any picklable object (a copy will then be made) and it includes channels and tasklets (for which a reference will automatically be created).
The use of the PyPy proxy object space is to make remote communication more Stackless like by passing object by reference. If a ref_object is made, only a reference will be passed when a tasklet is moved or the object is sent on a channel. The object always resides where it was created. A move() operation will also be implemented on those objects so they can be moved around like tasklets.
I hope it helps,
Gabriel
2010/7/29 Kevin Ar18>
Hello Kevin, I don't know if it can be a solution to your problem but for my Master Thesis I'm working on making Stackless Python distributed. What I did is working but not complete and I'm right now in the process of writing the thesis (in french unfortunately). My code currently works with PyPy's "stackless" module onlyis and use some PyPy specific things. Here's what I added to Stackless:
- Possibility to move tasklets easily (ref_tasklet.move(node_id)). A node is an instance of an interpreter. - Each tasklet has its global namespace (to avoid sharing of data). The state is also easier to move to another interpreter this way. - Distributed channels: All requests are known by all nodes using the channel. - Distributed objets: When a reference is sent to a remote node, the object is not copied, a reference is created using PyPy's proxy object space. - Automated dependency recovery when an object or a tasklet is loaded on another interpreter
With a proper scheduler, many tasklets could be automatically spread in multiple interpreters to use multiple cores or on multiple computers. A bit like the N:M threading model where N lightweight threads/coroutines can be executed on M threads.
Was able to have a look at the API... If others don't mind my asking this on the mailing list:
* .send() and .receive() What type of data can you send and receive between the tasklets? Can you pass entire Python objects?
* .send() and .receive() memory model When you send data between tasklets (pass messages) or whateve you want to call it, how is this implemented under the hood? Does it use shared memory under the hood or does it involve a more costly copying of the data? I realize that if it is on another machine you have to copy the data, but what about between two threads? You mentioned PyPy's proxy object.... guess I'll need to read up on that. _______________________________________________ pypy-dev@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev
-- Gabriel Lavoie glavoie@gmail.com
By the way, if you come to #pypy on FreeNode, I'm WildChild! I'm always there though not alway available. I'm in the EST timezone (UTC-5). See ya, Gabriel -- Gabriel Lavoie glavoie@gmail.com
![](https://secure.gravatar.com/avatar/8b7229d275bd0e27e3e256ef4f317761.jpg?s=120&d=mm&r=g)
Sorry for not gettin back to you sooner. I don't mind replying to the mailing list unless it annoys someone? Maybe some people could be interested by this discussion. You have a lot of questions! :) My answers are inline. * Message passing When you create a tasklet, you assign a set number of queues or streams to it (it can have many) and whether they extract data from them or write to them (they can only either extract or write to it as noted above). The tasklet's global namespace has access to these queues or streams and can extract or add data to them. In my case, I look at message passing from the perspective of the tasklet. A tasklet can either be assigned a certain number of "in ports" and a certain number of "out ports." In this case the "in ports" are the .read() end of a queue or stream and the "out ports" are the .send() part of a queue or stream. Sorry, I don't really understand what you're trying to explain here. Maybe an example could be helpful? :) * Scheduler For the scheduler, I would need to control when a tasklet runs. Currently, I am thinking that I would look at all the "in ports" that a tasklet has and make sure each one has some data. Only then would the tasklet be scheduled to run by the scheduler. Couldn't all those ports (channels) be read one at a time, then the processing could be done? I don't exactly see the need to play with the scheduler. Channels are blocking. A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available. http://www.jpaulmorrison.com/fbp/concepts.htm Figure 3.6 and Figure 3.7 are a good example. Let's say Figure 3.7 is the tasklet (the one with a local namespace and no access to global memory or memory in other tasklets). IN, ACC, REJ are pointers to a shared memory location (from an implementation standpoint). IN, ACC, REJ are either a queue or buffer/pipe/steam (from the perspective of the programmer). The tasklet can only read/extract data from IN. The tasklet can only write to ACC and REJ.
Couldn't all those ports (channels) be read one at a time, then the processing could be done? Not sure exactly, what you mean, but as shown in Figure 3.7, different parts of code will read or write to different ports at different times. A tasklet will be anyway unscheduled when it tries to read on a channel in which no data is available. Good idea. If there's no data to read, the tasklet can yield. ... but I need to know when the tasklet can be put back into the scheduler queue
Then again, I don't know how I will want to do the scheduler... and would like the low level primitives to explore different styles. Anyways, at this point, I guess this whole discussion is not that important. I should probably make something simpler for now just to try things out. Then maybe I'll know if I want to even bother working on something better. However, if you would like me to keep you up to date, I can contact you via email a few months from now. (Let me know and I'll give you a different email to use).
participants (10)
-
Alex Gaynor
-
Evan Cofsky
-
Gabriel Lavoie
-
holger krekel
-
Jim Baker
-
Kevin Ar18
-
Maciej Fijalkowski
-
Michael Sparks
-
Paolo Giarrusso
-
William Leslie