Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

May 17, 2016

      ...
Matěj Týč <matej.tyc@gmail.com> wrote:
...
- Parallel processing of HUGE data, and
This is mainly a Windows problem, as copy-on-write fork() will solve this
on any other platform. ...
That sounds interesting, could you elaborate on it a bit? Does it mean
On 17.5.2016 14:13, Sturla Molden wrote:

that if you pass the numpy array to the child process using Queue, no
significant amount of data will flow through it? Or I shouldn't pass it
using Queue at all and just rely on inheritance? Finally, I assume that
passing it as an argument to the Process class is the worst option,
because it will be pickled and unpickled.

Or maybe you refer to modules s.a. joblib that use this functionality
and expose only a nice interface?
And finally, cow means that returning large arrays still involves data
moving between processes, whereas the shm approach has the workaround
that you can preallocate the result array by the parent process, where
the worker process can write to.
...
What this means is that shared memory is seldom useful for sharing huge
data, even on Windows. It is only useful for this on Unix/Linux, where base
addresses can stay they same. But on non-Windows platforms, the COW will in
99.99% of the cases be sufficient, thus make shared memory superfluous
anyway. We don't need shared memory to scatter large data on Linux, only
fork.
I am actually quite comfortable with sharing numpy arrays only. It is a
nice format for sharing large amounts of numbers, which is what I want
and what many modules accept as input (e.g. the "shapely" module).