[pypy-dev] FW: Would the following shared memory model be possible?

Sat Jul 31 03:08:49 CEST 2010

On Thu, Jul 29, 2010 at 6:44 PM, Kevin Ar18 <kevinar18 at hotmail.com> wrote:
> You brought up a lot of topics.  I went ahead and sent you a private email.
> There's always lots of interesting things I can add to my list of things to
> learn about. :)

Yes, there are lots of interesting things. I have a limited amount of
time however (I should be in bed, it's very late here, but I do /try/
to reply to on-list mails), so cannot spood feed you. Mailing me
directly rather than a (relevant) list precludes you getting answers
from someone other than me. Not being on lists also precludes you
getting answers to questions by chance. Changing emails and names in
email headers also makes keeping track of people hard...

(For example you asked off list last year about Kamaelia's license
from a different email address. Since it wasn't searchable I
completely forgot. You also asked all sorts of questions but didn't
want the answers public, so I didn't reply. If instead you'd
subscribed to the list, and asked there, you'd've found out that
Kamaelia's license changed - to the Apache Software License v2 ...)

If I mention something you find interesting, please Google first and
then ask publicly somewhere relevant. (the answer and question are
then googleable, and you're doing the community a service IMO if you
ask q's that way - if you're question is somewhere relevant and shows
you've already googled prior work as far as you can... People are
always willing to help people who show willing to help themselves in
my experience.)

>> just looks to me that you're tieing yourself up in knots over things
>> that aren't problems, when there are some things which could be useful
>> (in practice) & interesting in this space.
> The particular issue in this situation is that there is no way to make
> Kamaelia, FBP, or other concurrency concepts run in parallel (unless you are
> willing to accept lots of overhead like with the multiprocessing queues).
>
> Since you have worked with Kamaelia code a lot... you understand a lot more
> about implementation details.  Do you think the previous shared memory
> concept or something like it would let you make Kamaelia parallel?
> If not, can you think of any method that would let you make Kamaelia
> parallel?

Kamaelia already CAN run components in parallel in different processes
(has been able to do so for quite some time) or on different
processors. Indeed, all you do is use a ProcessPipeline or
ProcessGraphline rather than Pipeline or Graphline, and the components
in the top level are spread across processes. I still view the code as
experimental, but it does work, and when needed is very useful.

Kamaelia running on Iron Python can run on seperate processors sharing
data efficiently (due to lack of GIL there) happily too. Threaded
components there do that naturally - I don't use IronPython, but it
does run on Iron Python. On windows this is easiest, though Mono works
just as well.

I believe Jython also is GIL free, and Kamaelia's Axon runs there
cleanly too. As a result because Kamaelia is pure python, it runs
truly in parallel there too (based on hearing from people using
kamaelia on jython). Cpython is the exception (and a rather big one at
that). (Pypy has a choice IIUC)

Personally, I think if PyPy worked with generators better (which is
why I keep an eye on PyPy) and cpyext was improved, it'd provide a
really compelling platform for me. (I was rather gutted at Europython
to hear that PyPy's generator support was still ... problematic)

Regarding the *efficiency* and *enforcement* of the approach taken, I
feel you're chasing the wrong tree, but let's go there.

What approach does baseline (non-Iron Python running) kamaelia take
for multi-process work?

For historical reasons, it builds on top of pprocess rather than
multiprocessing module based. This means for interprocess
communications objects are pickled before being sent over operating
system pipes.

This provides an obvious communications overhead - and this isn't
really kamaelia specific at this point.

However, shifting data from one CPU to another is expensive, and only
worth doing in some circumstances. (Consider a machine with several
physical CPUs - each has a local CPU cache, and the data needs to be
transferred from one to another, which is why partly people worry
about thread/CPU affinity etc)

Basically, if you can manage it, you don't want to shift data between
CPUs, you want to partition the processing.

ie you may want to start caring about the size of messages and number
of messages going between processes. Sending small and few between
processes is going to be preferable to sending large and many for
throughput purposes.

In the case of small and few, the approach of pickling and sending
across OS pipes isn't such a bad idea. It works.

If you do want to share data between CPUs, and it sounds like you do,
then most OSs already provide a means of doing that - threads. The
conventions people use for using threads are where they become
unpicked, but as a mechanism, threads do generally work, and work
well.

As well as channels/boxes, you can use an STM approach, such as than
in Axon.STM ...
    * http://www.kamaelia.org/STM.html
    * http://code.google.com/p/kamaelia/source/browse/trunk/Code/Python/Bindings/STM/

...which is logically very similar to version control for variables. A
downside of STM (at least with this approach) however, is that for it
to work, you need either copy on write semantics for objects, or full
copying of objects or similar. Personally I use a biological metaphor
here, in that channels/boxes and components, and similar perform a
similar function to axons and neurons in the body, and that STM is
akin to the hormonal system for maintaining and controlling system
state. (I modelled biological tree growth many moons ago)

Anyhow, coming back to threads, that brings us back to python, and
implementations with a GIL, and those without.

For implementations with a GIL, you then have a choice: do I choose to
try and implement a memory model that _enforces_ data locality? that
is if a piece of data is in use inside a single "process" or "thread"
(from hereon I'll use "task" as a generic phrase) that trying to use
it inside another causes a problem for the task attempting to breach
the model.

In order to enforce this, I personally believe you'd need to use
multiple processes, and only share data through dedicated code
managing shared memory. You could of course do this outside user code.
To do this you'd need an abstraction that made sense, and something
like stackless' channels or kamaelia's (in/out) box model makes sense
there. (The CELL API uses a mailbox metaphor as well for reference)

In that case, you have a choice. You either copy the data into shared
memory, or you share the data in situ. The former gives you back
precisely the same overhead previously described, or the latter
fragments your memory (since you can no longer access it). You could
also have compaction.

However, personally, I think any possible benefits here are outweighed
by the costs and complexity.

The alternative is to _encourage_ data locality. That is encourage the
usage and sharing of data such that whilst you could share data
between tasks and cause corruption that the common way of using the
system discourages such actions. In essence that's what I try to do in
Kamaelia, and it seems to work. Specifically, the model says:

    * If I take a piece of data from an inbox, I own it and can do anything
      with it that I like. If you think of a physical piece of paper and
      I take it from an intray, then that really is the case.

    * If I put a piece of data in an outbox, I no longer own it and should
      not attempt to do anything more with it. Again, using a physical
      metaphor, and naming scheme helps here. In particular, if I put a
      piece of paper in the post, I can no longer modify it. How it gets
      to its recipient is not my concern either.

In practice this does actually work. If you add in immutable tuples,
and immutable strings then it becomes a lot clearer how this can work.

Is there a risk here of accidental modification? Yes. However, the
size and general simplicity of components tends to lead to such
problems being picked up early. It also enables component level
acceptance tests. (We tend to build small examples of usage, which in
turn effectively form acceptance tests)

[ An alternative is to make the "send" primitive make a copy on send.
That would be quite an overhead, and also limit the types of data you
can send. ]

In practical terms, it works. (Stackless proves this as well IMO,
since despite some differences, there's also lots of similarities)

The other question that arises, is "isn't the GIL a problem with
threads?". Well, the answer to that really depends on what you're
doing. David Beazely's talk on what happens on mixing different sorts
of threads shows that it isn't ideal, and if you're hitting that
behaviour, then actually switching to real processes makes sense.
However if you're doing CPU intensive work inside a C extension which
releases the GIL (eg numpy), then it's less of an issue in practice.
Custom extensions can do the same.

So, for example, picking something which I know colleagues [1] at work
do, you can use a DVS broadcast capture card to capture video frames,
pass those between threads which are doing processing on them, and
inside those threads use c extensions to process the data efficiently
(since image processing does take time...), and those release the GIL
boosting throughput.

   [1] On this project : http://www.bbc.co.uk/rd/projects/2009/10/i3dlive.shtml

So, that makes it all sound great - ie things can, after various
fashions, run in parallel on various versions of python, to practical
benefit. But obviously it could be improved.

Personally, I think the project most likely to make a difference here
is actually pypy. Now, talk is very cheap, and easy, and I'm not
likely to implement this, so I'll aim to be brief. Execution is hard.

In particular, what I think is most likely to be beneficial is
something _like_ this:

Assume pypy runs without a GIL. Then allow the creation of a green
process. A green process is implemented using threads, but with data
created on the heap such that it defaults to being marked private to
the thread (ie ala thread local storage, but perhaps implemented
slightly differently - via references from the thread local storage
into the heap) rather than shared. Sharing between green processes
(for channels or boxes) would "simply" be detagged as being owned by
one thread, and passed to another.

In particular this would mean that you need a mechanism for doing
this. Simply attempting to call another green process (or thread) from
another with mutable data types would be sufficient to raise the
equivalent of a segmentation fault.

Secondly, improve cpyext to the extent that each cpython extension
gets it's own version of the GIL. (ie each extension runs with its own
logical runtime, and thinks that it has its own GIL which it can lock
and release. In practice it's faked by the PyPy runtime. This is
essentially similar conceptually to creating green processes.

It's worth considering that the Linux kernel went through similar
changes, in that in the 2.0 days there was a large single big lock,
which was replaced by ever granular locks. I personally think that
since there are so many extensions that rely on the existence of the
GIL simply waving a wand to get rid of it isn't likely. However
logically providing a GIL per C-Extension may be plausible, and _may_
be sufficient.

However, I don't know - it might well not - I've not looked at the
code, and talk is cheap - execution is hard.

Hopefully the above (cheap :) comments are in some small way useful.

Regards,

Michael.