This sounds like a significant milestone!

Is there some kind of optimized communication possible yet between subinterpreters? (Otherwise I still worry that it's no better than subprocesses -- and it could be worse because when one subinterpreter experiences a hard crash or runs out of memory, all others have to die with it.)

On Tue, May 5, 2020 at 2:54 PM Victor Stinner <> wrote:

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
gets its own GIL. I chose to benchmark a factorial function in pure
Python to simulate a CPU-bound workload. I wrote the simplest possible
function just to be able to run a benchmark, to check if the PEP 554
would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound
workload faster than sequential execution or threads and that they
have the same speed than multiprocessing. The performance scales well
with the number of CPUs.



    n = 50_000
    fact = 1
    for i in range(1, n + 1):
        fact = fact * i

2 CPUs:

    Sequential: 1.00 sec +- 0.01 sec
    Threads: 1.08 sec +- 0.01 sec
    Multiprocessing: 529 ms +- 6 ms
    Subinterpreters: 553 ms +- 6 ms

4 CPUs:

    Sequential: 1.99 sec +- 0.01 sec
    Threads: 3.15 sec +- 0.97 sec
    Multiprocessing: 560 ms +- 12 ms
    Subinterpreters: 583 ms +- 7 ms

8 CPUs:

    Sequential: 4.01 sec +- 0.02 sec
    Threads: 9.91 sec +- 0.54 sec
    Multiprocessing: 1.02 sec +- 0.01 sec
    Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on
this benchmark.

See attached to for
the code of the benchmark.


See and related issues for the
implementation. I already merged changes, but most code is disabled by
default: a new special undocumented
--with-experimental-isolated-subinterpreters build mode is required to
test it.

To reproduce the benchmark, use::

    # up to date checkout of Python master branch
    ./configure \
        --with-experimental-isolated-subinterpreters \
        --enable-optimizations \

Limits of subinterpreters design

Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters.
* Each interpreter has a minimum memory footprint, since Python
internal states and modules are duplicated.
* Others that I forgot :-)

Incomplete implementation

My proof-of-concept is just good enough to compute factorial with the
code that I wrote above :-) Any other code is very likely to crash in
various funny ways.

proof-of-concept. Most are temporary workarounds until some parts of
the code are modified to become compatible with subinterpreters, like
tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between
subinterpreters: like None and True singletons, but also statically
allocated types. Avoid shared states should enhance performances.

See for the current status and a
list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core
Python" project:

Night gathers, and now my watch begins. It shall not end until my death.
Python-Dev mailing list --
To unsubscribe send an email to
Message archived at
Code of Conduct:

--Guido van Rossum (
Pronouns: he/him (why is my pronoun here?)