
Matěj Týč <matej.tyc@gmail.com> wrote:
- Parallel processing of HUGE data, and This is mainly a Windows problem, as copy-on-write fork() will solve this on any other platform. ... That sounds interesting, could you elaborate on it a bit? Does it mean
On 17.5.2016 14:13, Sturla Molden wrote: that if you pass the numpy array to the child process using Queue, no significant amount of data will flow through it? Or I shouldn't pass it using Queue at all and just rely on inheritance? Finally, I assume that passing it as an argument to the Process class is the worst option, because it will be pickled and unpickled. Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface? And finally, cow means that returning large arrays still involves data moving between processes, whereas the shm approach has the workaround that you can preallocate the result array by the parent process, where the worker process can write to.
What this means is that shared memory is seldom useful for sharing huge data, even on Windows. It is only useful for this on Unix/Linux, where base addresses can stay they same. But on non-Windows platforms, the COW will in 99.99% of the cases be sufficient, thus make shared memory superfluous anyway. We don't need shared memory to scatter large data on Linux, only fork. I am actually quite comfortable with sharing numpy arrays only. It is a nice format for sharing large amounts of numbers, which is what I want and what many modules accept as input (e.g. the "shapely" module).