FW: Would the following shared memory model be possible?
Would comments from a project using this approach in real systems be of interest/use/help? Whilst I didn't know about Morrison's FBP (Balzer's work predates him btw - don't listen to hype) I had heard of (and played with) Occam among other more influential things, and Kamaelia is a real tool. Also there is already a pre-existing FBP tool for Stackless, and then historically there's also MASCOT & friends. It
You brought up a lot of topics. I went ahead and sent you a private email. There's always lots of interesting things I can add to my list of things to learn about. :)
just looks to me that you're tieing yourself up in knots over things that aren't problems, when there are some things which could be useful (in practice) & interesting in this space. The particular issue in this situation is that there is no way to make Kamaelia, FBP, or other concurrency concepts run in parallel (unless you are willing to accept lots of overhead like with the multiprocessing queues).
Since you have worked with Kamaelia code a lot... you understand a lot more about implementation details. Do you think the previous shared memory concept or something like it would let you make Kamaelia parallel? If not, can you think of any method that would let you make Kamaelia parallel?
On Thu, Jul 29, 2010 at 6:44 PM, Kevin Ar18 <kevinar18@hotmail.com> wrote:
You brought up a lot of topics. I went ahead and sent you a private email. There's always lots of interesting things I can add to my list of things to learn about. :)
Yes, there are lots of interesting things. I have a limited amount of time however (I should be in bed, it's very late here, but I do /try/ to reply to on-list mails), so cannot spood feed you. Mailing me directly rather than a (relevant) list precludes you getting answers from someone other than me. Not being on lists also precludes you getting answers to questions by chance. Changing emails and names in email headers also makes keeping track of people hard... (For example you asked off list last year about Kamaelia's license from a different email address. Since it wasn't searchable I completely forgot. You also asked all sorts of questions but didn't want the answers public, so I didn't reply. If instead you'd subscribed to the list, and asked there, you'd've found out that Kamaelia's license changed - to the Apache Software License v2 ...) If I mention something you find interesting, please Google first and then ask publicly somewhere relevant. (the answer and question are then googleable, and you're doing the community a service IMO if you ask q's that way - if you're question is somewhere relevant and shows you've already googled prior work as far as you can... People are always willing to help people who show willing to help themselves in my experience.)
just looks to me that you're tieing yourself up in knots over things that aren't problems, when there are some things which could be useful (in practice) & interesting in this space. The particular issue in this situation is that there is no way to make Kamaelia, FBP, or other concurrency concepts run in parallel (unless you are willing to accept lots of overhead like with the multiprocessing queues).
Since you have worked with Kamaelia code a lot... you understand a lot more about implementation details. Do you think the previous shared memory concept or something like it would let you make Kamaelia parallel? If not, can you think of any method that would let you make Kamaelia parallel?
Kamaelia already CAN run components in parallel in different processes (has been able to do so for quite some time) or on different processors. Indeed, all you do is use a ProcessPipeline or ProcessGraphline rather than Pipeline or Graphline, and the components in the top level are spread across processes. I still view the code as experimental, but it does work, and when needed is very useful. Kamaelia running on Iron Python can run on seperate processors sharing data efficiently (due to lack of GIL there) happily too. Threaded components there do that naturally - I don't use IronPython, but it does run on Iron Python. On windows this is easiest, though Mono works just as well. I believe Jython also is GIL free, and Kamaelia's Axon runs there cleanly too. As a result because Kamaelia is pure python, it runs truly in parallel there too (based on hearing from people using kamaelia on jython). Cpython is the exception (and a rather big one at that). (Pypy has a choice IIUC) Personally, I think if PyPy worked with generators better (which is why I keep an eye on PyPy) and cpyext was improved, it'd provide a really compelling platform for me. (I was rather gutted at Europython to hear that PyPy's generator support was still ... problematic) Regarding the *efficiency* and *enforcement* of the approach taken, I feel you're chasing the wrong tree, but let's go there. What approach does baseline (non-Iron Python running) kamaelia take for multi-process work? For historical reasons, it builds on top of pprocess rather than multiprocessing module based. This means for interprocess communications objects are pickled before being sent over operating system pipes. This provides an obvious communications overhead - and this isn't really kamaelia specific at this point. However, shifting data from one CPU to another is expensive, and only worth doing in some circumstances. (Consider a machine with several physical CPUs - each has a local CPU cache, and the data needs to be transferred from one to another, which is why partly people worry about thread/CPU affinity etc) Basically, if you can manage it, you don't want to shift data between CPUs, you want to partition the processing. ie you may want to start caring about the size of messages and number of messages going between processes. Sending small and few between processes is going to be preferable to sending large and many for throughput purposes. In the case of small and few, the approach of pickling and sending across OS pipes isn't such a bad idea. It works. If you do want to share data between CPUs, and it sounds like you do, then most OSs already provide a means of doing that - threads. The conventions people use for using threads are where they become unpicked, but as a mechanism, threads do generally work, and work well. As well as channels/boxes, you can use an STM approach, such as than in Axon.STM ... * http://www.kamaelia.org/STM.html * http://code.google.com/p/kamaelia/source/browse/trunk/Code/Python/Bindings/S... ...which is logically very similar to version control for variables. A downside of STM (at least with this approach) however, is that for it to work, you need either copy on write semantics for objects, or full copying of objects or similar. Personally I use a biological metaphor here, in that channels/boxes and components, and similar perform a similar function to axons and neurons in the body, and that STM is akin to the hormonal system for maintaining and controlling system state. (I modelled biological tree growth many moons ago) Anyhow, coming back to threads, that brings us back to python, and implementations with a GIL, and those without. For implementations with a GIL, you then have a choice: do I choose to try and implement a memory model that _enforces_ data locality? that is if a piece of data is in use inside a single "process" or "thread" (from hereon I'll use "task" as a generic phrase) that trying to use it inside another causes a problem for the task attempting to breach the model. In order to enforce this, I personally believe you'd need to use multiple processes, and only share data through dedicated code managing shared memory. You could of course do this outside user code. To do this you'd need an abstraction that made sense, and something like stackless' channels or kamaelia's (in/out) box model makes sense there. (The CELL API uses a mailbox metaphor as well for reference) In that case, you have a choice. You either copy the data into shared memory, or you share the data in situ. The former gives you back precisely the same overhead previously described, or the latter fragments your memory (since you can no longer access it). You could also have compaction. However, personally, I think any possible benefits here are outweighed by the costs and complexity. The alternative is to _encourage_ data locality. That is encourage the usage and sharing of data such that whilst you could share data between tasks and cause corruption that the common way of using the system discourages such actions. In essence that's what I try to do in Kamaelia, and it seems to work. Specifically, the model says: * If I take a piece of data from an inbox, I own it and can do anything with it that I like. If you think of a physical piece of paper and I take it from an intray, then that really is the case. * If I put a piece of data in an outbox, I no longer own it and should not attempt to do anything more with it. Again, using a physical metaphor, and naming scheme helps here. In particular, if I put a piece of paper in the post, I can no longer modify it. How it gets to its recipient is not my concern either. In practice this does actually work. If you add in immutable tuples, and immutable strings then it becomes a lot clearer how this can work. Is there a risk here of accidental modification? Yes. However, the size and general simplicity of components tends to lead to such problems being picked up early. It also enables component level acceptance tests. (We tend to build small examples of usage, which in turn effectively form acceptance tests) [ An alternative is to make the "send" primitive make a copy on send. That would be quite an overhead, and also limit the types of data you can send. ] In practical terms, it works. (Stackless proves this as well IMO, since despite some differences, there's also lots of similarities) The other question that arises, is "isn't the GIL a problem with threads?". Well, the answer to that really depends on what you're doing. David Beazely's talk on what happens on mixing different sorts of threads shows that it isn't ideal, and if you're hitting that behaviour, then actually switching to real processes makes sense. However if you're doing CPU intensive work inside a C extension which releases the GIL (eg numpy), then it's less of an issue in practice. Custom extensions can do the same. So, for example, picking something which I know colleagues [1] at work do, you can use a DVS broadcast capture card to capture video frames, pass those between threads which are doing processing on them, and inside those threads use c extensions to process the data efficiently (since image processing does take time...), and those release the GIL boosting throughput. [1] On this project : http://www.bbc.co.uk/rd/projects/2009/10/i3dlive.shtml So, that makes it all sound great - ie things can, after various fashions, run in parallel on various versions of python, to practical benefit. But obviously it could be improved. Personally, I think the project most likely to make a difference here is actually pypy. Now, talk is very cheap, and easy, and I'm not likely to implement this, so I'll aim to be brief. Execution is hard. In particular, what I think is most likely to be beneficial is something _like_ this: Assume pypy runs without a GIL. Then allow the creation of a green process. A green process is implemented using threads, but with data created on the heap such that it defaults to being marked private to the thread (ie ala thread local storage, but perhaps implemented slightly differently - via references from the thread local storage into the heap) rather than shared. Sharing between green processes (for channels or boxes) would "simply" be detagged as being owned by one thread, and passed to another. In particular this would mean that you need a mechanism for doing this. Simply attempting to call another green process (or thread) from another with mutable data types would be sufficient to raise the equivalent of a segmentation fault. Secondly, improve cpyext to the extent that each cpython extension gets it's own version of the GIL. (ie each extension runs with its own logical runtime, and thinks that it has its own GIL which it can lock and release. In practice it's faked by the PyPy runtime. This is essentially similar conceptually to creating green processes. It's worth considering that the Linux kernel went through similar changes, in that in the 2.0 days there was a large single big lock, which was replaced by ever granular locks. I personally think that since there are so many extensions that rely on the existence of the GIL simply waving a wand to get rid of it isn't likely. However logically providing a GIL per C-Extension may be plausible, and _may_ be sufficient. However, I don't know - it might well not - I've not looked at the code, and talk is cheap - execution is hard. Hopefully the above (cheap :) comments are in some small way useful. Regards, Michael.
participants (2)
-
Kevin Ar18
-
Michael Sparks