Parallel processing with Python

About a year ago, I posted a scheme to comp.lang.python describing how to use isolated interpreters to circumvent the GIL on SMPs: http://groups.google.no/group/comp.lang.python/msg/0351c532aad97c5e?hl=no&dmode=source In the following, an "appdomain" will be defined as a thread assosciated with an unique embedded Python interpreter. One interpreter per thread is how tcl work. Erlang also uses isolated threads that only communicate through messages (as opposed to shared objects). Appdomains are also available in the .NET framework, and in Java as "Java isolates". They are potentially very useful as multicore CPUs become abundant. They allow one process to run one independent Python interpreter on each available CPU core. In Python, "appdomains" can be created by embedding the Python interpreter multiple times in a process. For this to work, we have to make multiple copies of the Python DLL and rename them (e.g. Python25-0.dll, Python25-1.dll, Python25-2.dll, etc.) Otherwise the dynamic loader will just return a handle to the already imported DLL. As DLLs can be accessed with ctypes, we don't even have to program a line of C to do this. we can start up a Python interpreter and use ctypes to embed more interpreters into it, associating each interpreter with its own thread. ctypes takes care of releasing the GIL in the parent interpreter, so calls to these sub-interpreters become asynchronous. I had a mock-up of this scheme working. Martin Löwis replied he doubted this would work, and pointed out that Python extension libraries (.pyd files) are DLLs as well. They would only be imported once, and their global states would thus crash, thus producing havoc: http://groups.google.no/group/comp.lang.python/msg/0a7a22910c1d5bf5?hl=no&dmode=source He was right, of course, but also wrong. In fact I had already proven him wrong by importing a DLL multiple times. If it can be done for Python25.dll, it can be done for any other DLL as well - including .pyd files - in exactly the same way. Thus what remains is to change Python's dynamic loader to use the same "copy and import" scheme. This can either be done by changing Python's C code, or (at least on Windows) to redirect the LoadLibrary API call from kernel32.dll to a custom DLL. Both a quite easy and requires minimal C coding. Thus it is quite easy to make multiple, independent Python interpreters live isolated lives in the same process. As opposed to multiple processes, they can communicate without involving any IPC. It would also be possible to design proxy objects allowing one interpreter access to an object in another. Immutable object such as strings would be particularly easy to share. This very simple scheme should allow parallel processing with Python similar to how it's done in Erlang, without the GIL getting in our way. At least on Windows this can be done without touching the CPython source at all. I am not sure about Linux though. I may be necessary to patch the CPython source to make it work there. Sturla Molden

On Wed, Feb 18, 2009 at 4:34 PM, Sturla Molden <sturla@molden.no> wrote:
To clarify: * Erlang's modules/classes/functions are not first-class objects, so it doesn't need a copy of them. Python does, so each interpreter would have a memory footprint about the same as a true process. * Any communication requires a serialize/copy/deserialize sequence. You don't need a full context switch, but it's still not cheap. * It's probably not worth sharing even str objects. You'd need atomic refcounting and a hack in Py_TYPE to always give the local type, both of which would slow everything down. The real use case here is when you have a large, existing library that you're not willing to modify to use a custom (shared memory) allocator. That library must have a large data set (too large to duplicate in each process), and not an external database, must be multithreaded in a scalable way, yet be too performance sensitive for real IPC. Also, any data shared between interpreters must be part of that large, existing library, rather than python objects. Finally, since each interpreter uses as much memory as a process you must only need a small, fixed number of interpreters, preferably long running (or at least a thread pool). If that describes your use case then I'm happy for you, go ahead and use this DLL trick. -- Adam Olsen, aka Rhamphoryncus

On 2/19/2009 3:34 AM, Adam Olsen wrote:
Yes, each interpreter will have the memory footprint of an interpreter. So the memory use would be about the same as with multiprocessing.
* Any communication requires a serialize/copy/deserialize sequence.
No it does not, and this why embedded interpreters are better than multiple processes (cf. multiprocessing). Since the interpreters share virtual memory, many objects can be shared without any serialization. That is, C pointers will be valid in both interpreters, so it should in many cases be possible to pass a PyObject* from one interpreter to another. This kind of communication would be easiest to achieve with immutable objects. Another advantage is that there will be just one process to kill, whenever that is required. S.M.

On Thu, Feb 19, 2009 at 2:53 AM, Sturla Molden <sturla@molden.no> wrote:
Only if you have an approach to GC that does not require locking. The current reference counting macros are not thread-safe so every INCREF and DECREF would have to be somehow protected by a lock or turned into an atomic operation. Recall that many frequently used objects, like None, small integers, and interned strings for common identifiers, are shared and constantly INCREFed and DECREFed. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Thu, Feb 19, 2009 at 2:53 AM, Sturla Molden <sturla@molden.no> wrote:
Thanks for the info. I think this would be just a minor inconvinience. Sending a message in the form of PyObject *x from A to B could perhaps be solved like this: Interpreter A: Increfs immutable pyobj x aquires the GIL of interpreter B messages pyobject x to interpreter B releases the GIL of interpreter B Interpreter B: Creates a proxy object p for reading attributes of x Increfs & decrefs p (refcounts for x or its attributes are not touched by p) when p is collected: aquires the GIL of interpreter A decrefs x releases the GIL of interpreter A Synchronization of reference counting is obviously needed (and the GIL is great for that). But serialization of the whole object would be avoided. This would depend on immutability of the message object. S.M.

Sturla Molden wrote:
Interpreter B: Creates a proxy object p for reading attributes of x
This is not sufficient. When code running in interpreter B reads an attribute of x via the proxy, it will get a reference to some other object belonging to interpreter A.
This would depend on immutability of the message object.
Immutability doesn't save you. Even immutable objects get their refcounts adjusted just like any other object. -- Greg

On Thu, Feb 19, 2009 at 3:53 AM, Sturla Molden <sturla@molden.no> wrote:
If you could directly use another interpreter's PyObject in the current interpreter then they wouldn't separate interpreters. You need to apply it for the type objects too, and if you start sharing those you'll kill any performance advantage of this whole scheme. The realistic scenario is you treat each interpreter as a monitor: you can call into another interpreter quite cheaply (release your GIL, set your current interpreter to be them, acquire their GIL). However, since you are only in one at any given point in time, you need to copy anything you want to transmit. To put it another way, your proxy objects can hold pointers to the other interpreter's objects, but you can't use them until you go back into that other interpreter. -- Adam Olsen, aka Rhamphoryncus

On Wed, Feb 18, 2009 at 4:34 PM, Sturla Molden <sturla@molden.no> wrote:
To clarify: * Erlang's modules/classes/functions are not first-class objects, so it doesn't need a copy of them. Python does, so each interpreter would have a memory footprint about the same as a true process. * Any communication requires a serialize/copy/deserialize sequence. You don't need a full context switch, but it's still not cheap. * It's probably not worth sharing even str objects. You'd need atomic refcounting and a hack in Py_TYPE to always give the local type, both of which would slow everything down. The real use case here is when you have a large, existing library that you're not willing to modify to use a custom (shared memory) allocator. That library must have a large data set (too large to duplicate in each process), and not an external database, must be multithreaded in a scalable way, yet be too performance sensitive for real IPC. Also, any data shared between interpreters must be part of that large, existing library, rather than python objects. Finally, since each interpreter uses as much memory as a process you must only need a small, fixed number of interpreters, preferably long running (or at least a thread pool). If that describes your use case then I'm happy for you, go ahead and use this DLL trick. -- Adam Olsen, aka Rhamphoryncus

On 2/19/2009 3:34 AM, Adam Olsen wrote:
Yes, each interpreter will have the memory footprint of an interpreter. So the memory use would be about the same as with multiprocessing.
* Any communication requires a serialize/copy/deserialize sequence.
No it does not, and this why embedded interpreters are better than multiple processes (cf. multiprocessing). Since the interpreters share virtual memory, many objects can be shared without any serialization. That is, C pointers will be valid in both interpreters, so it should in many cases be possible to pass a PyObject* from one interpreter to another. This kind of communication would be easiest to achieve with immutable objects. Another advantage is that there will be just one process to kill, whenever that is required. S.M.

On Thu, Feb 19, 2009 at 2:53 AM, Sturla Molden <sturla@molden.no> wrote:
Only if you have an approach to GC that does not require locking. The current reference counting macros are not thread-safe so every INCREF and DECREF would have to be somehow protected by a lock or turned into an atomic operation. Recall that many frequently used objects, like None, small integers, and interned strings for common identifiers, are shared and constantly INCREFed and DECREFed. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

On Thu, Feb 19, 2009 at 2:53 AM, Sturla Molden <sturla@molden.no> wrote:
Thanks for the info. I think this would be just a minor inconvinience. Sending a message in the form of PyObject *x from A to B could perhaps be solved like this: Interpreter A: Increfs immutable pyobj x aquires the GIL of interpreter B messages pyobject x to interpreter B releases the GIL of interpreter B Interpreter B: Creates a proxy object p for reading attributes of x Increfs & decrefs p (refcounts for x or its attributes are not touched by p) when p is collected: aquires the GIL of interpreter A decrefs x releases the GIL of interpreter A Synchronization of reference counting is obviously needed (and the GIL is great for that). But serialization of the whole object would be avoided. This would depend on immutability of the message object. S.M.

Sturla Molden wrote:
Interpreter B: Creates a proxy object p for reading attributes of x
This is not sufficient. When code running in interpreter B reads an attribute of x via the proxy, it will get a reference to some other object belonging to interpreter A.
This would depend on immutability of the message object.
Immutability doesn't save you. Even immutable objects get their refcounts adjusted just like any other object. -- Greg

On Thu, Feb 19, 2009 at 3:53 AM, Sturla Molden <sturla@molden.no> wrote:
If you could directly use another interpreter's PyObject in the current interpreter then they wouldn't separate interpreters. You need to apply it for the type objects too, and if you start sharing those you'll kill any performance advantage of this whole scheme. The realistic scenario is you treat each interpreter as a monitor: you can call into another interpreter quite cheaply (release your GIL, set your current interpreter to be them, acquire their GIL). However, since you are only in one at any given point in time, you need to copy anything you want to transmit. To put it another way, your proxy objects can hold pointers to the other interpreter's objects, but you can't use them until you go back into that other interpreter. -- Adam Olsen, aka Rhamphoryncus
participants (4)
-
Adam Olsen
-
Greg Ewing
-
Guido van Rossum
-
Sturla Molden