
On Thu, 7 Sep 2017 21:08:48 -0700 Nathaniel Smith <njs@pobox.com> wrote:
Awesome, thanks for bringing numbers into my wooly-headed theorizing :-).
On my laptop I actually get a worse result from your benchmark: 531 ms for 100 MB == ~200 MB/s round-trip, or 400 MB/s one-way. So yeah, transferring data between processes with multiprocessing is slow.
This is odd, though, because on the same machine, using socat to send 1 GiB between processes using a unix domain socket runs at 2 GB/s:
When using local communication, the raw IPC cost is often minor compared to whatever Python does with the data (parse it, dispatch tasks around, etc.) except when the data is really huge. Local communications on Linux can easily reach several GB/s (even using TCP to localhost). Here is a Python script with reduced overhead to measure it -- as opposed to e.g. a full-fledged event loop: https://gist.github.com/pitrou/d809618359915967ffc44b1ecfc2d2ad
I don't know why multiprocessing is so slow -- maybe there's a good reason, maybe not.
Be careful to measure actual bandwidth, not round-trip latency, however.
But the reason isn't that IPC is intrinsically slow, and subinterpreters aren't going to automatically be 5x faster because they can use memcpy.
What could improve performance significantly would be to share objects without any form of marshalling; but it's not obvious it's possible in the subinterpreters model *if* it also tries to remove the GIL. You can see it readily with concurrent.futures, when comparing ThreadPoolExecutor and ProcessPoolExecutor:
import concurrent.futures as cf ...:tp = cf.ThreadPoolExecutor(4) ...:pp = cf.ProcessPoolExecutor(4) ...:x = b"x" * (100 * 1024**2) ...:def identity(x): return x ...: y = list(tp.map(identity, [x] * 10)) # warm up len(y) 10 y = list(pp.map(identity, [x] * 10)) # warm up len(y) 10 %timeit y = list(tp.map(identity, [x] * 10)) 638 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit y = list(pp.map(identity, [x] * 10)) 1.99 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On this trivial case you're really gaining a lot using a thread pool... Regards Antoine.