Re: [Python-ideas] PEP 554: Stdlib Module to Support Multiple Interpreters in Python Code

Sept. 10, 2017

      On Thu, 7 Sep 2017 21:08:48 -0700
Nathaniel Smith <njs@pobox.com> wrote:
...
Awesome, thanks for bringing numbers into my wooly-headed theorizing :-).
On my laptop I actually get a worse result from your benchmark: 531 ms
for 100 MB == ~200 MB/s round-trip, or 400 MB/s one-way. So yeah,
transferring data between processes with multiprocessing is slow.
This is odd, though, because on the same machine, using socat to send
1 GiB between processes using a unix domain socket runs at 2 GB/s:
When using local communication, the raw IPC cost is often minor
compared to whatever Python does with the data (parse it, dispatch
tasks around, etc.) except when the data is really huge.

Local communications on Linux can easily reach several GB/s (even using
TCP to localhost).  Here is a Python script with reduced overhead to
measure it -- as opposed to e.g. a full-fledged event loop:
https://gist.github.com/pitrou/d809618359915967ffc44b1ecfc2d2ad
...
I don't know why multiprocessing is so slow -- maybe there's a good
reason, maybe not.
Be careful to measure actual bandwidth, not round-trip latency, however.
...
But the reason isn't that IPC is intrinsically
slow, and subinterpreters aren't going to automatically be 5x faster
because they can use memcpy.
What could improve performance significantly would be to share objects
without any form of marshalling; but it's not obvious it's possible in
the subinterpreters model *if* it also tries to remove the GIL.

You can see it readily with concurrent.futures, when comparing
ThreadPoolExecutor and ProcessPoolExecutor:
...
...
...
import concurrent.futures as cf
...:tp = cf.ThreadPoolExecutor(4)
...:pp = cf.ProcessPoolExecutor(4)
...:x = b"x" * (100 * 1024**2)
...:def identity(x): return x
...:
y = list(tp.map(identity, [x] * 10))  # warm up
len(y)
10
y = list(pp.map(identity, [x] * 10))  # warm up
len(y)
10
%timeit y = list(tp.map(identity, [x] * 10))
638 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit y = list(pp.map(identity, [x] * 10))
1.99 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On this trivial case you're really gaining a lot using a thread pool...

Regards

Antoine.

Re: [Python-ideas] PEP 554: Stdlib Module to Support Multiple Interpreters in Python Code

Antoine Pitrou