[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

5 May 2020


      I'm seeing a drop in performance of both multiprocess and subinterpreter 
based runs in the 8-CPU case, where performance drops by about half 
despite having enough logical CPUs, while the other cases scale quite 
well. Is there some issue with python multiprocessing/subinterpreters on 
the same logical core?

On 5/5/20 2:46 PM, Victor Stinner wrote:
...
Hi,
I wrote a "per-interpreter GIL" proof-of-concept: each interpreter
gets its own GIL. I chose to benchmark a factorial function in pure
Python to simulate a CPU-bound workload. I wrote the simplest possible
function just to be able to run a benchmark, to check if the PEP 554
would be relevant.
The proof-of-concept proves that subinterpreters can make a CPU-bound
workload faster than sequential execution or threads and that they
have the same speed than multiprocessing. The performance scales well
with the number of CPUs.
Performance
===========
Factorial:
n = 50_000
     fact = 1
     for i in range(1, n + 1):
         fact = fact * i
2 CPUs:
Sequential: 1.00 sec +- 0.01 sec
     Threads: 1.08 sec +- 0.01 sec
     Multiprocessing: 529 ms +- 6 ms
     Subinterpreters: 553 ms +- 6 ms
4 CPUs:
Sequential: 1.99 sec +- 0.01 sec
     Threads: 3.15 sec +- 0.97 sec
     Multiprocessing: 560 ms +- 12 ms
     Subinterpreters: 583 ms +- 7 ms
8 CPUs:
Sequential: 4.01 sec +- 0.02 sec
     Threads: 9.91 sec +- 0.54 sec
     Multiprocessing: 1.02 sec +- 0.01 sec
     Subinterpreters: 1.10 sec +- 0.00 sec
Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU
cores with Hyper Threading).
Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than
sequential execution.
Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER
than sequential execution.
Subinterpreters and multiprocessing have basically the same speed on
this benchmark.
See demo-pyperf.py attached to https://bugs.python.org/issue40512 for
the code of the benchmark.
Implementation
==============
See https://bugs.python.org/issue40512 and related issues for the
implementation. I already merged changes, but most code is disabled by
default: a new special undocumented
--with-experimental-isolated-subinterpreters build mode is required to
test it.
To reproduce the benchmark, use::
# up to date checkout of Python master branch
     ./configure \
         --with-experimental-isolated-subinterpreters \
         --enable-optimizations \
         --with-lto
     make
     ./python demo-pyperf.py
Limits of subinterpreters design
================================
Subinterpreters have a few design limits:
* A Python object must not be shared between two interpreters.
* Each interpreter has a minimum memory footprint, since Python
internal states and modules are duplicated.
* Others that I forgot :-)
Incomplete implementation
=========================
My proof-of-concept is just good enough to compute factorial with the
code that I wrote above :-) Any other code is very likely to crash in
various funny ways.
I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the
proof-of-concept. Most are temporary workarounds until some parts of
the code are modified to become compatible with subinterpreters, like
tuple free lists or Unicode interned strings.
Right now, there are still some states which are shared between
subinterpreters: like None and True singletons, but also statically
allocated types. Avoid shared states should enhance performances.
See https://bugs.python.org/issue40512 for the current status and a
list of tasks.
Most of these tasks are already tracked in Eric Snow's "Multi Core
Python" project:
https://github.com/ericsnowcurrently/multi-core-python/issues
Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCER...
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Joseph Jenne