On Tue, May 5, 2020 at 2:54 PM Victor Stinner <vstinner@python.org> wrote:

Hi,

I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
gets its own GIL. I chose to benchmark a factorial function in pure
Python to simulate a CPU-bound workload. I wrote the simplest possible
function just to be able to run a benchmark, to check if the PEP 554
would be relevant.

The proof-of-concept proves that subinterpreters can make a CPU-bound
workload faster than sequential execution or threads and that they
have the same speed than multiprocessing. The performance scales well
with the number of CPUs.

Performance
===========

Factorial:

n = 50_000
fact = 1
for i in range(1, n + 1):
fact = fact * i

2 CPUs:

Sequential: 1.00 sec +- 0.01 sec
Threads: 1.08 sec +- 0.01 sec
Multiprocessing: 529 ms +- 6 ms
Subinterpreters: 553 ms +- 6 ms

4 CPUs:

Sequential: 1.99 sec +- 0.01 sec
Threads: 3.15 sec +- 0.97 sec
Multiprocessing: 560 ms +- 12 ms
Subinterpreters: 583 ms +- 7 ms

8 CPUs:

Sequential: 4.01 sec +- 0.02 sec
Threads: 9.91 sec +- 0.54 sec
Multiprocessing: 1.02 sec +- 0.01 sec
Subinterpreters: 1.10 sec +- 0.00 sec

Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
cores with Hyper Threading).

Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
sequential execution.

Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
than sequential execution.

Subinterpreters and multiprocessing have basically the same speed on
this benchmark.

See demo-pyperf.py attached to https://bugs.python.org/issue40512 for
the code of the benchmark.

Implementation
==============

See https://bugs.python.org/issue40512 and related issues for the
implementation. I already merged changes, but most code is disabled by
default: a new special undocumented
--with-experimental-isolated-subinterpreters build mode is required to
test it.

To reproduce the benchmark, use::

# up to date checkout of Python master branch
./configure \
--with-experimental-isolated-subinterpreters \
--enable-optimizations \
--with-lto
make
./python demo-pyperf.py

Limits of subinterpreters design
================================

Subinterpreters have a few design limits:

* A Python object must not be shared between two interpreters.
* Each interpreter has a minimum memory footprint, since Python
internal states and modules are duplicated.
* Others that I forgot :-)

Incomplete implementation
=========================

My proof-of-concept is just good enough to compute factorial with the
code that I wrote above :-) Any other code is very likely to crash in
various funny ways.

I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the
proof-of-concept. Most are temporary workarounds until some parts of
the code are modified to become compatible with subinterpreters, like
tuple free lists or Unicode interned strings.

Right now, there are still some states which are shared between
subinterpreters: like None and True singletons, but also statically
allocated types. Avoid shared states should enhance performances.

See https://bugs.python.org/issue40512 for the current status and a
list of tasks.

Most of these tasks are already tracked in Eric Snow's "Multi Core
Python" project:
https://github.com/ericsnowcurrently/multi-core-python/issues

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCEREZLA2PEMTVFBCDM52H4JSENR/
Code of Conduct: http://python.org/psf/codeofconduct/