[Python-ideas] multiprocessing IPC
Sturla Molden
sturla at molden.no
Sat Feb 11 00:36:15 CET 2012
Den 10.02.2012 22:15, skrev Mike Meyer:
> In what way does the mmap module fail to provide your binary file
> interface? <mike
The short answer is that BSD mmap creates an anonymous kernel object.
When working with multiprocessing for a while, one comes to the
conclusion that we really need named kernel objects.
Here are two simple fail cases for anonymous kernel objects:
- Process A spawns/forks process B.
- Process B creates an object, one of the attributes is a lock.
- Fail: This object cannot be communicated back to process A. B inherits
from A, A does not inherit from B.
- Process A spawns/forks a process pool.
- Process A creates an object, one of the attributes is a lock.
- Fail: This object cannot be communicated to the pool. They do not
inherit new handles from A after they are started.
All of multiprocessing's IPC classes suffer from this!
Solution:
Use named kernel objects for IPC, pickle the name.
I made a shared memory array for NumPy that workes like this --
implemented by memory mapping from the paging file on Windows, System V
IPC on Linux. Underneath is an extension class that allocates a shared
memory buffer. When pickled it encodes the kernel name, not its content,
and unpickling opens the object given its name.
There is another drawback too:
The speed of pickle. For example, sharing NumPy arrays with pickle is
not faster with shared memory. The overhead from pickle completely
dominate the time needed for IPC . That is why I want a type specialized
or a binary channel. Making this from the named shared memory class I
already have is a no-brainer.
So that is my other objection against multiprocessing.
That is:
1. Object sharing by handle inheritance fails when kernel objects must
be passed back to the parent process or to a process pool. We need IPC
objects that have a name in the kernel, so they can be created and
shared in retrospect.
2. IPC with multiprocessing is too slow due to pickle. We need something
that does not use pickle. (E.g. shared memory, but not by means of
mmap.) It might be that the pipe or socket in multiprocessing will do
this (I have not looked at it carefully enough), but they still don't have
Proof of concept:
http://dl.dropbox.com/u/12464039/sharedmem-feb12-2009.zip
Dependency on Cython and NumPy should probably be removed, never mind
that. Important part is this:
sharedmemory_sysv.pyx (Linux)
sharedmemory_win.pyx and ntqueryobject.c (Windows)
Finally, I'd like to say that I think Python's standard lib should
support high-performance asynchronous I/O for concurrency. That is not
poll/select (on Windows it does not even work properly). Rather, I want
IOCP on Windows, epoll on Linux, and kqueue on Mac. (Yes I know about
twisted.) There should also be a requirement that it works with
multiprocessing. E.g. if we open a process pool, the processes should be
able to use the same IOCP. In other words some highly scalable
asynchronous I/O that works with multiprocessing.
So ... As far as I am concerned, the only thing worth keeping in
multipricessing is multiprocessing.Process and multiprocessing.Pool. The
rest doesn't do what we want.
Sturla
More information about the Python-ideas
mailing list