
I'm seeing a drop in performance of both multiprocess and subinterpreter based runs in the 8-CPU case, where performance drops by about half despite having enough logical CPUs, while the other cases scale quite well. Is there some issue with python multiprocessing/subinterpreters on the same logical core? On 5/5/20 2:46 PM, Victor Stinner wrote:
Hi,
I wrote a "per-interpreter GIL" proof-of-concept: each interpreter gets its own GIL. I chose to benchmark a factorial function in pure Python to simulate a CPU-bound workload. I wrote the simplest possible function just to be able to run a benchmark, to check if the PEP 554 would be relevant.
The proof-of-concept proves that subinterpreters can make a CPU-bound workload faster than sequential execution or threads and that they have the same speed than multiprocessing. The performance scales well with the number of CPUs.
Performance ===========
Factorial:
n = 50_000 fact = 1 for i in range(1, n + 1): fact = fact * i
2 CPUs:
Sequential: 1.00 sec +- 0.01 sec Threads: 1.08 sec +- 0.01 sec Multiprocessing: 529 ms +- 6 ms Subinterpreters: 553 ms +- 6 ms
4 CPUs:
Sequential: 1.99 sec +- 0.01 sec Threads: 3.15 sec +- 0.97 sec Multiprocessing: 560 ms +- 12 ms Subinterpreters: 583 ms +- 7 ms
8 CPUs:
Sequential: 4.01 sec +- 0.02 sec Threads: 9.91 sec +- 0.54 sec Multiprocessing: 1.02 sec +- 0.01 sec Subinterpreters: 1.10 sec +- 0.00 sec
Benchmarks run on my laptop which has 8 logical CPUs (4 physical CPU cores with Hyper Threading).
Threads are between 1.1x (2 CPUs) and 2.5x (8 CPUs) SLOWER than sequential execution.
Subinterpreters are between 1.8x (2 CPUs) and 3.6x (8 CPUs) FASTER than sequential execution.
Subinterpreters and multiprocessing have basically the same speed on this benchmark.
See demo-pyperf.py attached to https://bugs.python.org/issue40512 for the code of the benchmark.
Implementation ==============
See https://bugs.python.org/issue40512 and related issues for the implementation. I already merged changes, but most code is disabled by default: a new special undocumented --with-experimental-isolated-subinterpreters build mode is required to test it.
To reproduce the benchmark, use::
# up to date checkout of Python master branch ./configure \ --with-experimental-isolated-subinterpreters \ --enable-optimizations \ --with-lto make ./python demo-pyperf.py
Limits of subinterpreters design ================================
Subinterpreters have a few design limits:
* A Python object must not be shared between two interpreters. * Each interpreter has a minimum memory footprint, since Python internal states and modules are duplicated. * Others that I forgot :-)
Incomplete implementation =========================
My proof-of-concept is just good enough to compute factorial with the code that I wrote above :-) Any other code is very likely to crash in various funny ways.
I added a few "#ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS" for the proof-of-concept. Most are temporary workarounds until some parts of the code are modified to become compatible with subinterpreters, like tuple free lists or Unicode interned strings.
Right now, there are still some states which are shared between subinterpreters: like None and True singletons, but also statically allocated types. Avoid shared states should enhance performances.
See https://bugs.python.org/issue40512 for the current status and a list of tasks.
Most of these tasks are already tracked in Eric Snow's "Multi Core Python" project: https://github.com/ericsnowcurrently/multi-core-python/issues
Victor -- Night gathers, and now my watch begins. It shall not end until my death. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5GZZCER... Code of Conduct: http://python.org/psf/codeofconduct/