Mailman 3 PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround - Python-Dev

PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Victor Stinner

6 May 2020 6 May '20

3:16 a.m.

Hi, I wrote a "per-interpreter GIL" proof-of-concept: each interpreter gets its own GIL. I chose to benchmark a factorial function in pure Python to simulate a CPU-bound workload. I wrote the simplest possible function just to be able to run a benchmark, to check if the PEP 554 would be relevant. The proof-of-concept proves that subinterpreters can make a CPU-bound workload faster than sequential execution or threads and that they have the same speed than multiprocessing. The performance scales well with the number of CPUs. Performance =========== Factorial: n = 50_000 fact = 1 for i in range(1, n + 1): fact = fact * i 2 CPUs: Sequential: 1.00 sec +- 0.01 sec Threads: 1.08 sec +- 0.01 sec Multiprocessing: 529 ms +- 6 ms Subinterpreters: 553 ms +- 6 ms 4 CPUs: Sequential: 1.99 sec +- 0.01 sec Threads: 3.15 sec +- 0.97 sec Multiprocessing: 560 ms +- 12 ms Subinterpreters: 583 ms +- 7 ms 8 CPUs: Sequential: 4.01 sec +- 0.02 sec Threads: 9.91 sec +- 0.54 sec Multiprocessing: 1.02 sec +- 0.01 sec Subinterpreters: 1.10 sec +- 0.00 sec Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU cores with Hyper Threading). Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than sequential execution. Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER than sequential execution. Subinterpreters and multiprocessing have basically the same speed on this benchmark. See demo-pyperf.py attached to https://bugs.python.org/issue40512 for the code of the benchmark. Implementation ============== See https://bugs.python.org/issue40512 and related issues for the implementation. I already merged changes, but most code is disabled by default: a new special undocumented --with-experimental-isolated-subinterpreters build mode is required to test it. To reproduce the benchmark, use:: # up to date checkout of Python master branch ./configure \ --with-experimental-isolated-subinterpreters \ --enable-optimizations \ --with-lto make ./python demo-pyperf.py Limits of subinterpreters design ================================ Subinterpreters have a few design limits: * A Python object must not be shared between two interpreters. * Each interpreter has a minimum memory footprint, since Python internal states and modules are duplicated. * Others that I forgot :-) Incomplete implementation ========================= My proof-of-concept is just good enough to compute factorial with the code that I wrote above :-) Any other code is very likely to crash in various funny ways. I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the proof-of-concept. Most are temporary workarounds until some parts of the code are modified to become compatible with subinterpreters, like tuple free lists or Unicode interned strings. Right now, there are still some states which are shared between subinterpreters: like None and True singletons, but also statically allocated types. Avoid shared states should enhance performances. See https://bugs.python.org/issue40512 for the current status and a list of tasks. Most of these tasks are already tracked in Eric Snow's "Multi Core Python" project: https://github.com/ericsnowcurrently/multi-core-python/issues Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Show replies by date

Brett Cannon

6 May 6 May

3:30 a.m.

Just to be clear, this is executing the **same** workload in parallel, **not** trying to parallelize factorial. E.g. the 8 CPU calculation is calculating 50,000! 8 separate times and not calculating 50,000! once by spreading the work across 8 CPUs. This measurement is still showing parallel work, but now I'm really curious to see the first calculation where you're measuring how much faster a calculation is thanks to sub-interpreters. :) I also realize this is not optimized in any way, so being this close to multiprocessing already is very encouraging!

Guido van Rossum

4:10 a.m.

This sounds like a significant milestone! Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.) On Tue, May 5, 2020 at 2:54 PM Victor Stinner wrote:

...

Hi,

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter gets its own GIL. I chose to benchmark a factorial function in pure Python to simulate a CPU-bound workload. I wrote the simplest possible function just to be able to run a benchmark, to check if the PEP 554 would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound workload faster than sequential execution or threads and that they have the same speed than multiprocessing. The performance scales well with the number of CPUs.

Performance ===========

Factorial:

n = 50_000 fact = 1 for i in range(1, n + 1): fact = fact * i

2 CPUs:

Sequential: 1.00 sec +- 0.01 sec Threads: 1.08 sec +- 0.01 sec Multiprocessing: 529 ms +- 6 ms Subinterpreters: 553 ms +- 6 ms

4 CPUs:

Sequential: 1.99 sec +- 0.01 sec Threads: 3.15 sec +- 0.97 sec Multiprocessing: 560 ms +- 12 ms Subinterpreters: 583 ms +- 7 ms

8 CPUs:

Sequential: 4.01 sec +- 0.02 sec Threads: 9.91 sec +- 0.54 sec Multiprocessing: 1.02 sec +- 0.01 sec Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on this benchmark.

See demo-pyperf.py attached to https://bugs.python.org/issue40512 for the code of the benchmark.

Implementation ==============

See https://bugs.python.org/issue40512 and related issues for the implementation. I already merged changes, but most code is disabled by default: a new special undocumented --with-experimental-isolated-subinterpreters build mode is required to test it.

To reproduce the benchmark, use::

# up to date checkout of Python master branch ./configure \ --with-experimental-isolated-subinterpreters \ --enable-optimizations \ --with-lto make ./python demo-pyperf.py

Limits of subinterpreters design ================================

Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters. * Each interpreter has a minimum memory footprint, since Python internal states and modules are duplicated. * Others that I forgot :-)

Incomplete implementation =========================

My proof-of-concept is just good enough to compute factorial with the code that I wrote above :-) Any other code is very likely to crash in various funny ways.

I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the proof-of-concept. Most are temporary workarounds until some parts of the code are modified to become compatible with subinterpreters, like tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between subinterpreters: like None and True singletons, but also statically allocated types. Avoid shared states should enhance performances.

See https://bugs.python.org/issue40512 for the current status and a list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core Python" project: https://github.com/ericsnowcurrently/multi-core-python/issues

Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCER... Code of Conduct: http://python.org/psf/codeofconduct/

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...

Nathaniel Smith

7:29 a.m.

On Tue, May 5, 2020 at 3:47 PM Guido van Rossum wrote:

...

This sounds like a significant milestone!

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers. On my laptop, some rough measurements [1] suggest that simply piping bytes between processes goes at ~2.8 gigabytes/second, and that pickle/unpickle is ~10x slower than that. So that would suggest that once subinterpreters are fully optimized, they might provide a maximum ~10% speedup vs multiprocessing, for a program that's doing nothing except passing pickled objects back and forth. Of course, any real program that's spawning parallel workers will presumably be designed so its workers spend most of their time doing work on that data, not just passing it back and forth. That makes a 10% speedup highly unrealistic; in real-world programs it will be much smaller. So IIUC, subinterpreter communication is currently about the same speed as multiprocessing communication, and the plan is to keep it that way. -n [1] Of course there are a lot of assumptions in my quick back-of-the-envelope calculation: pickle speed depends on the details of the objects being pickled, there are other serialization formats, there are other IPC methods that might be faster but are more complicated (shared memory), the stdlib 'multiprocessing' library might not be as good as it could be (the above measurements are for an ideal multiprocessing library, I haven't tested the one we currently have in the stdlib), etc. So maybe there's some situation where subinterpreters look better. But I've been pointing out this issue to Eric et al for years and they haven't disputed it, so I guess they haven't found one yet. -- Nathaniel J. Smith -- https://vorpus.org

Emily Bowman

2:22 p.m.

Main memory bus or cache contention? Integer execution ports full? Throttling? VTune is useful to find out where the bottleneck is, things like that tend to happen when you start loading every logical core. On Tue, May 5, 2020 at 4:45 PM Joseph Jenne via Python-Dev < python-dev@python.org> wrote:

...

I'm seeing a drop in performance of both multiprocess and subinterpreter based runs in the 8-CPU case, where performance drops by about half despite having enough logical CPUs, while the other cases scale quite well. Is there some issue with python multiprocessing/subinterpreters on the same logical core?

Victor Stinner

6:11 p.m.

Hi Nathaniel, Le mer. 6 mai 2020 à 04:00, Nathaniel Smith a écrit :

...

As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers.

I think that you misunderstood the PEP 554. It's a bare minimum API, and the idea is to *extend* it later to have an efficient implementation of "shared objects". -- IMO it should easy to share *data* (object "content") between subinterpreters, but each interpreter should have its own PyObject which exposes the data at the Python level. See the PyObject has a proxy to data. It would badly hurt performance if a PyObject is shared by two interpreters: it would require locking or atomic variables for PyObject members and PyGC_Head members. It seems like right now, the PEP 554 doesn't support sharing data, so it should still be designed and implemented later. Who owns the data? When can we release memory? Which interpreter releases the memory? I read somewhere that data is owned by the interpreter which allocates the memory, and its memory would be released in the same interpreter. How do we track data lifetime? I imagine a reference counter. When it reaches zero, the interpreter which allocates the data can release it "later" (it doesn't have to be done "immediately"). How to lock the whole data or a portion of data to prevent data races? If data doesn't contain any PyObject, it may be safe to allow concurrent writes, but readers should be prepared for inconsistencies depending on the access pattern. If two interpreters access separated parts of the data, we may allow lock-free access. I don't think that we have to reinvent the wheel. threading, multiprocessing and asyncio already designed such APIs. We should to design similar APIs and even simply reuse code. My hope is that "synchronization" (in general, locks in specific) will be more efficient in the same process, than synchronization between multiple processes. -- I would be interested to have a generic implementation of "remote object": a empty proxy object which forward all operations to a different interpreter. It will likely be inefficient, but it may be convenient for a start. If a method returns an object, a new proxy should be created. Simple scalar types like int and short strings may be serialized (copied). Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Guido van Rossum

8:27 p.m.

Okay, an image is appearing. It sounds like GIL-free subinterpreters may one day shine because IPC is faster and simpler within one process than between multiple processes. This is not exactly what I got from PEP 554 but it is sufficient for me to have confidence in the project. On Wed, May 6, 2020 at 5:41 AM Victor Stinner wrote:

...

Hi Nathaniel,

Le mer. 6 mai 2020 à 04:00, Nathaniel Smith a écrit :

...
As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers.

I think that you misunderstood the PEP 554. It's a bare minimum API, and the idea is to *extend* it later to have an efficient implementation of "shared objects".

--

IMO it should easy to share *data* (object "content") between subinterpreters, but each interpreter should have its own PyObject which exposes the data at the Python level. See the PyObject has a proxy to data.

It would badly hurt performance if a PyObject is shared by two interpreters: it would require locking or atomic variables for PyObject members and PyGC_Head members.

It seems like right now, the PEP 554 doesn't support sharing data, so it should still be designed and implemented later.

Who owns the data? When can we release memory? Which interpreter releases the memory? I read somewhere that data is owned by the interpreter which allocates the memory, and its memory would be released in the same interpreter.

How do we track data lifetime? I imagine a reference counter. When it reaches zero, the interpreter which allocates the data can release it "later" (it doesn't have to be done "immediately").

How to lock the whole data or a portion of data to prevent data races? If data doesn't contain any PyObject, it may be safe to allow concurrent writes, but readers should be prepared for inconsistencies depending on the access pattern. If two interpreters access separated parts of the data, we may allow lock-free access.

I don't think that we have to reinvent the wheel. threading, multiprocessing and asyncio already designed such APIs. We should to design similar APIs and even simply reuse code.

My hope is that "synchronization" (in general, locks in specific) will be more efficient in the same process, than synchronization between multiple processes.

--

I would be interested to have a generic implementation of "remote object": a empty proxy object which forward all operations to a different interpreter. It will likely be inefficient, but it may be convenient for a start. If a method returns an object, a new proxy should be created. Simple scalar types like int and short strings may be serialized (copied).

Victor -- Night gathers, and now my watch begins. It shall not end until my death.

-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...

Nathaniel Smith

7 May 7 May

12:19 a.m.

On Wed, May 6, 2020 at 5:41 AM Victor Stinner wrote:

...

Hi Nathaniel,

Le mer. 6 mai 2020 à 04:00, Nathaniel Smith a écrit :

...
As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers.

I think that you misunderstood the PEP 554. It's a bare minimum API, and the idea is to *extend* it later to have an efficient implementation of "shared objects".

No, I get this part :-)

...

IMO it should easy to share *data* (object "content") between subinterpreters, but each interpreter should have its own PyObject which exposes the data at the Python level. See the PyObject has a proxy to data.

So when you say "shared object" you mean that you're sharing a raw memory buffer, and then you're writing a Python object that stores its data inside that memory buffer instead of inside its __dict__: class MySharedObject: def __init__(self, shared_memview, shared_lock): self._shared_memview = shared_memview self._shared_lock = shared_lock @property def my_attr(self): with self._shared_lock: return struct.unpack_from(MY_ATTR_FORMAT, self._shared_memview, MY_ATTR_OFFSET)[0] @my_attr.setter def my_attr(self, new_value): with self._shared_lock: struct.pack_into(MY_ATTR_FORMAT, self._shared_memview, MY_ATTR_OFFSET, new_value) This is an interesting idea, but I think when most people say "sharing objects between subinterpreters", they mean being able to pass some pre-existing object between subinterpreters cheaply, while this requires defining custom objects with custom locking. So we should probably use different terms for them to avoid confusion :-). This is an interesting idea, and it's true that it's not considered in my post you're responding to. I was focusing on copying objects, not sharing objects on an ongoing basis. You can't implement this kind of "shared object" using a pipe/socket, because those create two independent copies of the data. But... if this is what you want, you can do the exact same thing with subprocesses too. OSes provide inter-process shared memory and inter-process locks. 'MySharedObject' above would work exactly the same. So I think the conclusion still holds: there aren't any plans to make IPC between subinterpreters meaningfully faster than IPC between subprocesses.

...

I don't think that we have to reinvent the wheel. threading, multiprocessing and asyncio already designed such APIs. We should to design similar APIs and even simply reuse code.

Or, we could simply *use* the code instead of using subinterpreters :-). (Or write new and better code, I feel like there's a lot of room for a modern 'multiprocessing' competitor.) The question I'm trying to figure out is what advantage subinterpreters give us over these proven technologies, and I'm still not seeing it.

...

My hope is that "synchronization" (in general, locks in specific) will be more efficient in the same process, than synchronization between multiple processes.

Hmm, I would be surprised by that – the locks in modern OSes are highly-optimized, and designed to work across subprocesses. For example, on Linux, futexes work across processes. Have you done any benchmarks? Also btw, note that if you want to use async within your subinterpreters, then that rules out a lot of tools like regular locks, because they can't be integrated into an event loop. If your subinterpreters are using async, then you pretty much *have* to use full-fledged sockets or equivalent for synchronization.

...

I would be interested to have a generic implementation of "remote object": a empty proxy object which forward all operations to a different interpreter. It will likely be inefficient, but it may be convenient for a start. If a method returns an object, a new proxy should be created. Simple scalar types like int and short strings may be serialized (copied).

How would this be different than https://docs.python.org/3/library/multiprocessing.html#proxy-objects ? How would you handle input arguments -- would those get proxied as well? Also, does this mean the other subinterpreter has to be running an event loop to process these incoming requests? Or is the idea that the other subinterpreter would process these inside a traditional Python thread, so users are exposed to all the classic shared-everything locking issues? -n -- Nathaniel J. Smith -- https://vorpus.org

Antoine Pitrou

6 May 6 May

10:31 p.m.

On Tue, 5 May 2020 18:59:34 -0700 Nathaniel Smith wrote:

...

On Tue, May 5, 2020 at 3:47 PM Guido van Rossum wrote:

...
This sounds like a significant milestone!

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers.

Which would be useful already, especially with pickle out-of-band buffers. Regards Antoine.

Nathaniel Smith

7 May 7 May

1:05 a.m.

On Wed, May 6, 2020 at 10:03 AM Antoine Pitrou wrote:

...

On Tue, 5 May 2020 18:59:34 -0700 Nathaniel Smith wrote:

...
On Tue, May 5, 2020 at 3:47 PM Guido van Rossum wrote:

...
This sounds like a significant milestone!

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

As far as I understand it, the subinterpreter folks have given up on optimized passing of objects, and are only hoping to do optimized (zero-copy) passing of raw memory buffers.

Which would be useful already, especially with pickle out-of-band buffers.

Sure, zero cost is always better than some cost, I'm not denying that :-). What I'm trying to understand is whether the difference is meaningful enough to justify subinterpreters' increased complexity, fragility, and ecosystem breakage. If your data is in large raw memory buffers to start with (like numpy arrays or arrow dataframes), then yeah, serialization costs are smaller proportion of IPC costs. And out-of-band buffers are an elegant way of letting pickle users take advantage of that speedup while still using the familiar pickle API. Thanks for writing that PEP :-). But when you're in the regime where you're working with large raw memory buffers, then that's also the regime where inter-process shared-memory becomes really efficient. Hence projects like Ray/Plasma [1], which exist today, and even work for sharing data across languages and across multi-machine clusters. And the pickle out-of-band buffer API is general enough to work with shared memory too. And even if you can't quite manage zero-copy, and have to settle for one-copy... optimized raw data copying is just *really fast*, similar to memory access speeds. And CPU-bound, big-data-crunching apps are by definition going to access that memory and do stuff with it that's much more expensive than a single memcpy. So I still have trouble figuring out how skipping a single memcpy will make subinterpreters significantly faster that subprocesses in any real-world scenario. -n [1] https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/ https://github.com/ray-project/ray -- Nathaniel J. Smith -- https://vorpus.org

Emily Bowman

2:18 p.m.

On Wed, May 6, 2020 at 12:36 PM Nathaniel Smith wrote:

...

Sure, zero cost is always better than some cost, I'm not denying that :-). What I'm trying to understand is whether the difference is meaningful enough to justify subinterpreters' increased complexity, fragility, and ecosystem breakage.

If your data is in large raw memory buffers to start with (like numpy arrays or arrow dataframes), then yeah, serialization costs are smaller proportion of IPC costs. And out-of-band buffers are an elegant way of letting pickle users take advantage of that speedup while still using the familiar pickle API. Thanks for writing that PEP :-).

But when you're in the regime where you're working with large raw memory buffers, then that's also the regime where inter-process shared-memory becomes really efficient. Hence projects like Ray/Plasma [1], which exist today, and even work for sharing data across languages and across multi-machine clusters. And the pickle out-of-band buffer API is general enough to work with shared memory too.

And even if you can't quite manage zero-copy, and have to settle for one-copy... optimized raw data copying is just *really fast*, similar to memory access speeds. And CPU-bound, big-data-crunching apps are by definition going to access that memory and do stuff with it that's much more expensive than a single memcpy. So I still have trouble figuring out how skipping a single memcpy will make subinterpreters significantly faster that subprocesses in any real-world scenario.

While large object copies are fairly fast -- I wouldn't say trivial, a gigabyte copy will introduce noticeable lag when processing enough of them -- the flip side of having large objects is that you want to avoid having so many copies that you run into memory pressure and the dreaded swapping. A multiprocessing engine that's fully parallel, every fork takes chunks of data and does everything needed to them won't gain much from zero-copy as long as memory limits aren't hit. But a pipeline of processing would involve many copies, especially if you have a central dispatch thread that passes things from stage to stage. This is a big deal where stages may take longer or slower at any time, especially in low-latency applications, like video conferencing, where dispatch needs the flexibility to skip steps or add extra workers to shove a frame out the door, and using signals to interact with separate processes to tell them to do so is more latency and overhead. Not that I'm recommending someone go out and make a pure Python videoconferencing unit right now, but it's a use case I'm familiar with. (Since I use Python to test new ideas before converting them into C++.)

Eric Snow

10:20 p.m.

On Thu, May 7, 2020 at 2:50 AM Emily Bowman wrote:

...

While large object copies are fairly fast -- I wouldn't say trivial, a gigabyte copy will introduce noticeable lag when processing enough of them -- the flip side of having large objects is that you want to avoid having so many copies that you run into memory pressure and the dreaded swapping. A multiprocessing engine that's fully parallel, every fork takes chunks of data and does everything needed to them won't gain much from zero-copy as long as memory limits aren't hit. But a pipeline of processing would involve many copies, especially if you have a central dispatch thread that passes things from stage to stage. This is a big deal where stages may take longer or slower at any time, especially in low-latency applications, like video conferencing, where dispatch needs the flexibility to skip steps or add extra workers to shove a frame out the door, and using signals to interact with separate processes to tell them to do so is more latency and overhead.

Not that I'm recommending someone go out and make a pure Python videoconferencing unit right now, but it's a use case I'm familiar with. (Since I use Python to test new ideas before converting them into C++.)

Thanks for the insight, Emily (and everyone else). It's really helpful to get many different expert perspectives on the matter. I am definitely not an expert on big-data/high-performance use cases so, personally, I rely on folks like Nathaniel, Travis Oliphant, and yourself. The more, the better. :) Again, thanks! -eric

Barry Scott

6 May 6 May

11:40 p.m.

...

On 5 May 2020, at 23:40, Guido van Rossum wrote:

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

I had already concluded that this would not be useful for the use cases I have at work. The running out of memory and the hard crash is what would stop me using this in production. For my day job I work on a service that forks slave processes to handle I/O transactions. There is a monitor process that manages the total memory of all slaves and shutdown and replaces slaves when they use too much memory. Typically there are 60 to 100 slaves with a core each to play with. The service runs 24x365. Barry

Sebastian Krause

8 May 8 May

12:58 p.m.

Guido van Rossum wrote:

...

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

The use case that I have in mind with subinterpreters is Windows. With its lack of fork() and the way it spawns a fresh interpreter process it always feels a bit weird to use multiprocessing on Windows. Would it be faster and/or cleaner to start a new in-process subinterpreter instead?

Nathaniel Smith

1:32 p.m.

On Fri, May 8, 2020 at 12:30 AM Sebastian Krause wrote:

...

Guido van Rossum wrote:

...
Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

The use case that I have in mind with subinterpreters is Windows. With its lack of fork() and the way it spawns a fresh interpreter process it always feels a bit weird to use multiprocessing on Windows. Would it be faster and/or cleaner to start a new in-process subinterpreter instead?

Subinterpreters don't support fork() either -- they can't share any objects, so each one has to start from a blank slate and go through the Python startup sequence, re-import all modules from scratch, etc. Subinterpreters do get to skip the OS process spawn overhead, but most of the startup costs are the same. -n -- Nathaniel J. Smith -- https://vorpus.org

Joseph Jenne

6 May 6 May

4:14 a.m.

I'm seeing a drop in performance of both multiprocess and subinterpreter based runs in the 8-CPU case, where performance drops by about half despite having enough logical CPUs, while the other cases scale quite well. Is there some issue with python multiprocessing/subinterpreters on the same logical core? On 5/5/20 2:46 PM, Victor Stinner wrote:

...

Hi,

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter gets its own GIL. I chose to benchmark a factorial function in pure Python to simulate a CPU-bound workload. I wrote the simplest possible function just to be able to run a benchmark, to check if the PEP 554 would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound workload faster than sequential execution or threads and that they have the same speed than multiprocessing. The performance scales well with the number of CPUs.

Performance ===========

Factorial:

n = 50_000 fact = 1 for i in range(1, n + 1): fact = fact * i

2 CPUs:

Sequential: 1.00 sec +- 0.01 sec Threads: 1.08 sec +- 0.01 sec Multiprocessing: 529 ms +- 6 ms Subinterpreters: 553 ms +- 6 ms

4 CPUs:

Sequential: 1.99 sec +- 0.01 sec Threads: 3.15 sec +- 0.97 sec Multiprocessing: 560 ms +- 12 ms Subinterpreters: 583 ms +- 7 ms

8 CPUs:

Sequential: 4.01 sec +- 0.02 sec Threads: 9.91 sec +- 0.54 sec Multiprocessing: 1.02 sec +- 0.01 sec Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on this benchmark.

See demo-pyperf.py attached to https://bugs.python.org/issue40512 for the code of the benchmark.

Implementation ==============

See https://bugs.python.org/issue40512 and related issues for the implementation. I already merged changes, but most code is disabled by default: a new special undocumented --with-experimental-isolated-subinterpreters build mode is required to test it.

To reproduce the benchmark, use::

# up to date checkout of Python master branch ./configure \ --with-experimental-isolated-subinterpreters \ --enable-optimizations \ --with-lto make ./python demo-pyperf.py

Limits of subinterpreters design ================================

Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters. * Each interpreter has a minimum memory footprint, since Python internal states and modules are duplicated. * Others that I forgot :-)

Incomplete implementation =========================

My proof-of-concept is just good enough to compute factorial with the code that I wrote above :-) Any other code is very likely to crash in various funny ways.

I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the proof-of-concept. Most are temporary workarounds until some parts of the code are modified to become compatible with subinterpreters, like tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between subinterpreters: like None and True singletons, but also statically allocated types. Avoid shared states should enhance performances.

See https://bugs.python.org/issue40512 for the current status and a list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core Python" project: https://github.com/ericsnowcurrently/multi-core-python/issues

Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCER... Code of Conduct: http://python.org/psf/codeofconduct/

Cody Piersall

7 May 7 May

11:11 p.m.

On Tue, May 5, 2020 at 6:44 PM Joseph Jenne via Python-Dev wrote:

...

I'm seeing a drop in performance of both multiprocess and subinterpreter based runs in the 8-CPU case, where performance drops by about half despite having enough logical CPUs, while the other cases scale quite well. Is there some issue with python multiprocessing/subinterpreters on the same logical core?

This is not a Python issue at all, but a limitation of logical cores. The logical cores still share the same physical resources, so the logical cores are contending for the same execution resources. Actually it would probably be bad if Python *didn't* scale this way, because that would indicate that a Python process that should be running full-blast isn't actually utilizing all the physical resources of a CPU! -Cody

Serhiy Storchaka

1:35 a.m.

06.05.20 00:46, Victor Stinner пише:

...

Subinterpreters and multiprocessing have basically the same speed on this benchmark.

It does not look like there are some advantages of subinterpreters against multiprocessing. I am wondering how much 3.9 will be slower than 3.8 in single-thread single-interpreter mode after getting rid of all process-wide singletons and caches (Py_None, Py_True, Py_NonImplemented. small integers, strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning breaking binary compatibility.

Cameron Simpson

5:54 a.m.

On 06May2020 23:05, Serhiy Storchaka wrote:

...

06.05.20 00:46, Victor Stinner пише:

...
Subinterpreters and multiprocessing have basically the same speed on this benchmark.

It does not look like there are some advantages of subinterpreters against multiprocessing.

Maybe I'm missing something, but the example that comes to my mind is embedding a Python interpreter in an existing nonPython programme. My pet one-day-in-the-future example is mutt, whose macro language is... crude. And mutt is single threaded. However, it is easy to envisage a monolithic multithreaded programme which has use for Python subinterpreters to work on the larger programme's in-memory data structures. I haven't a real world example to hand, but that is the architectural situation where I'd consider multiprocessing to be inappropriate or infeasible because the target data are all in the one memory space. Cheers, Cameron Simpson

Paul Moore

12:27 p.m.

On Thu, 7 May 2020 at 01:34, Cameron Simpson wrote:

...

Maybe I'm missing something, but the example that comes to my mind is embedding a Python interpreter in an existing nonPython programme.

My pet one-day-in-the-future example is mutt, whose macro language is... crude. And mutt is single threaded.

However, it is easy to envisage a monolithic multithreaded programme which has use for Python subinterpreters to work on the larger programme's in-memory data structures.

I haven't a real world example to hand, but that is the architectural situation where I'd consider multiprocessing to be inappropriate or infeasible because the target data are all in the one memory space.

Vim would be a very good example of this. Vim has Python interpreter support, but multiprocessing would not be viable as you say. And from my recollection, experiments with threading didn't end well when I tried them :-) Paul

Gregory P. Smith

8 May 8 May

1:06 a.m.

On Wed, May 6, 2020 at 1:14 PM Serhiy Storchaka wrote:

...

06.05.20 00:46, Victor Stinner пише:

...
Subinterpreters and multiprocessing have basically the same speed on this benchmark.

It does not look like there are some advantages of subinterpreters against multiprocessing.

There is not an implementation worthy of comparison at this point, no. I don't believe meaningful conclusions of that comparative nature can be drawn from the current work. We shouldn't be blocking any decision on reducing our existing tech debt around subinterpreters on a viable multi-core solution existing. There are benchmarks I could propose that I predict would show a different result even today but I'm refraining because I believe such things to be a distraction. I am wondering how much 3.9 will be slower than 3.8 in single-thread

...

single-interpreter mode after getting rid of all process-wide singletons and caches (Py_None, Py_True, Py_NonImplemented. small integers, strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning breaking binary compatibility.

I'm not worried, because it won't happen in 3.9. :) Nobody is seriously proposing that that be done in that manner. The existing example work Victor did here (thanks!) was a rapid prototype where the easiest approach to getting _something_ running parallel as a demo was just to disable a bunch of shared global things instead of also doing much larger work to make those per-interpreter. That isn't how we'd likely ever actually land this kind of change. Longer term we need to aim to get rid of process global state by moving that into per-interpreter state. No matter what. This isn't something only needed by subinterpreters. Corralling everything into a per-interpreter state with proper initialization and finalization everywhere allows other nice things like multiple independent interpreters in a process. Even sequentially (spin up, tear down, spin up, tear down, repeat...). We cannot reliably do that today without side effects such as duplicate initializations and resulting resource leaks or worse. Even if such per-interpreter state instead of per-process state isolation is never used for parallel execution, I still want to see it happen. Python already loses out to Lua because of this. Lua is easily embedded in a self-contained fashion. CPython has never been. This kind of work helps open up that world instead of relegating us to only single life-of-the-process long lived language VM uses that we can serve today. -gps

Victor Stinner

4:32 a.m.

Le mer. 6 mai 2020 à 22:10, Serhiy Storchaka a écrit :

...

I am wondering how much 3.9 will be slower than 3.8 in single-thread single-interpreter mode after getting rid of all process-wide singletons and caches (Py_None, Py_True, Py_NonImplemented. small integers, strings, tuples, _Py_IDENTIFIER, _PyArg_Parser, etc). Not mentioning breaking binary compatibility.

There is no plan to remove caches like small integers, _Py_IDENTIFIER or _PyArg_Parser. The plan is to make these caches "per-interpreter". I already modified small integers to make them per-interpreter. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

1448

Age (days ago)

1451

Last active (days ago)

List overview

Download

21 comments

15 participants

participants (15)

Antoine Pitrou
Barry Scott
Brett Cannon
Cameron Simpson
Cody Piersall
Emily Bowman
Eric Snow
Gregory P. Smith
Guido van Rossum
Joseph Jenne
Nathaniel Smith
Paul Moore
Sebastian Krause
Serhiy Storchaka
Victor Stinner

PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

tags

participants (15)