[Python-ideas] PEP 554: Stdlib Module to Support Multiple Interpreters in Python Code

Sun Sep 10 15:14:00 EDT 2017

On Thu, 7 Sep 2017 21:08:48 -0700
Nathaniel Smith <njs at pobox.com> wrote:
> 
> Awesome, thanks for bringing numbers into my wooly-headed theorizing :-).
> 
> On my laptop I actually get a worse result from your benchmark: 531 ms
> for 100 MB == ~200 MB/s round-trip, or 400 MB/s one-way. So yeah,
> transferring data between processes with multiprocessing is slow.
> 
> This is odd, though, because on the same machine, using socat to send
> 1 GiB between processes using a unix domain socket runs at 2 GB/s:

When using local communication, the raw IPC cost is often minor
compared to whatever Python does with the data (parse it, dispatch
tasks around, etc.) except when the data is really huge.

Local communications on Linux can easily reach several GB/s (even using
TCP to localhost).  Here is a Python script with reduced overhead to
measure it -- as opposed to e.g. a full-fledged event loop:
https://gist.github.com/pitrou/d809618359915967ffc44b1ecfc2d2ad

> I don't know why multiprocessing is so slow -- maybe there's a good
> reason, maybe not.

Be careful to measure actual bandwidth, not round-trip latency, however.

> But the reason isn't that IPC is intrinsically
> slow, and subinterpreters aren't going to automatically be 5x faster
> because they can use memcpy.

What could improve performance significantly would be to share objects
without any form of marshalling; but it's not obvious it's possible in
the subinterpreters model *if* it also tries to remove the GIL.

You can see it readily with concurrent.futures, when comparing
ThreadPoolExecutor and ProcessPoolExecutor:

>>> import concurrent.futures as cf
...:tp = cf.ThreadPoolExecutor(4)
...:pp = cf.ProcessPoolExecutor(4)
...:x = b"x" * (100 * 1024**2)
...:def identity(x): return x
...:
>>> y = list(tp.map(identity, [x] * 10))  # warm up
>>> len(y)
10
>>> y = list(pp.map(identity, [x] * 10))  # warm up
>>> len(y)
10
>>> %timeit y = list(tp.map(identity, [x] * 10))
638 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit y = list(pp.map(identity, [x] * 10))
1.99 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On this trivial case you're really gaining a lot using a thread pool...

Regards

Antoine.