Numpy arrays shareable among related processes (PR #7533)
![](https://secure.gravatar.com/avatar/695b15f15ed8500b23eee2985c1719e8.jpg?s=120&d=mm&r=g)
Dear Numpy developers, I propose a pull request https://github.com/numpy/numpy/pull/7533 that features numpy arrays that can be shared among processes (with some effort). Why: In CPython, multiprocessing is the only way of how to exploit multi-core CPUs if your parallel code can't avoid creating Python objects. In that case, CPython's GIL makes threads unusable. However, unlike with threading, sharing data among processes is something that is non-trivial and platform-dependent. Although numpy (and certainly some other packages) implement some operations in a way that GIL is not a concern, consider another case: You have a large amount of data in a form of a numpy array and you want to pass it to a function of an arbitrary Python module that also expects numpy array (e.g. list of vertices coordinates as an input and array of the corresponding polygon as an output). Here, it is clear GIL is an issue you and since you want a numpy array on both ends, now you would have to copy your numpy array to a multiprocessing.Array (to pass the data) and then to convert it back to ndarray in the worker process. This contribution would streamline it a bit - you would create an array as you are used to, pass it to the subprocess as you would do with the multiprocessing.Array, and the process can work with a numpy array right away. How: The idea is to create a numpy array in a buffer that can be shared among processes. Python has support for this in its standard library, so the current solution creates a multiprocessing.Array and then passes it as the "buffer" to the ndarray.__new__. That would be it on Unixes, but on Windows, there has to be a a custom pickle method, otherwise the array "forgets" that its buffer is that special and the sharing doesn't work. Some of what has been said in the pull request & my answer to that: * ... I do see some value in providing a canonical right way to construct shared memory arrays in NumPy, but I'm not very happy with this solution, ... terrible code organization (with the global variables): * I understand that, however this is a pattern of Python multiprocessing and everybody who wants to use the Pool and shared data either is familiar with this approach or has to become familiar with[2, 3]. The good compromise is to have a separate module for each parallel calculation, so global variables are not a problem. * Can you explain why the ndarray subclass is needed? Subclasses can be rather annoying to get right, and also for other reasons. * The shmarray class needs the custom pickler (but only on Windows). * If there's some way to we can paper over the boilerplate such that users can use it without understanding the arcana of multiprocessing, then yes, that would be great. But otherwise I'm not sure there's anything to be gained by putting it in a library rather than referring users to the examples on StackOverflow [1] [2]. * What about telling users: "You can use numpy with multiprocessing. Remeber the multiprocessing.Value and multiprocessing.Aray classes? numpy.shm works exactly the same way, which means that it shares their limitations. Refer to an example: <link to numpy doc>." Notice that although those SO links contain all of the information, it is very difficult to get it up and running for a newcomer like me few years ago. * This needs tests and justification for custom pickling methods, which are not used in any of the current examples. ... * I am sorry, but don't fully understand that point. The custom pickling method of shmarray has to be there on Windows, but users don't have to know about it at all. As noted earlier, the global variable is the only way of using standard Python multiprocessing.Pool with shared objects. [1]: http://stackoverflow.com/questions/10721915/shared-memory-objects-in-python-... [2]: http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-... [3]: http://stackoverflow.com/questions/1675766/how-to-combine-pool-map-with-arra...
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Mon, Apr 11, 2016 at 5:39 AM, Matěj Týč <matej.tyc@gmail.com> wrote:
OK, we can agree to disagree on this one. I still don't think I could get code using this pattern checked in at my work (for good reason).
* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
I guess I'm still not convinced this is the best we can with the multiprocessing library. If we're going to do this, then we definitely need to have the fully canonical example. For example, could you make the shared array a global variable and then still pass references to functions called by the processes anyways? The examples on stackoverflow that we're both looking are varied enough that it's not obvious to me that this is as good as it gets. * This needs tests and justification for custom pickling methods,
That sounds like a fine justification, but given that it wasn't obvious you needs a comment saying as much in the source code :). Also, it breaks pickle, which is another limitation that needs to be documented.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
I did some work on this some years ago. I have more or less concluded that it was a waste of effort. But first let me explain what the suggested approach do not work. As it uses memory mapping to create shared memory (i.e. shared segments are not named), they must be created ahead of spawning processes. But if you really want this to work smoothly, you want named shared memory (Sys V IPC or posix shm_open), so that shared arrays can be created in the spawned processes and passed back. Now for the reason I don't care about shared memory arrays anymore, and what I am currently working on instead: 1. I have come across very few cases where threaded code cannot be used in numerical computing. In fact, multithreading nearly always happens in the code where I write pure C or Fortran anyway. Most often it happens in library code that are already multithreaded (Intel MKL, Apple Accelerate Framework, OpenBLAS, etc.), which means using it requires no extra effort from my side. A multithreaded LAPACK library is not less multithreaded if I call it from Python. 2. Getting shared memory right can be difficult because of hierarchical memory and false sharing. You might not see it if you only have a multicore CPU with a shared cache. But your code might not scale up on computers with more than one physical processor. False sharing acts like the GIL, except it happens in hardware and affects your C code invisibly without any explicit locking you can pinpoint. This is also why MPI code tends to scale much better than OpenMP code. If nothing is shared there will be no false sharing. 3. Raw C level IPC is cheap – very, very cheap. Even if you use pipes or sockets instead of shared memory it is cheap. There are very few cases where the IPC tends to be a bottleneck. 4. The reason IPC appears expensive with NumPy is because multiprocessing pickles the arrays. It is pickle that is slow, not the IPC. Some would say that the pickle overhead is an integral part of the IPC ovearhead, but i will argue that it is not. The slowness of pickle is a separate problem alltogether. 5. Share memory does not improve on the pickle overhead because also NumPy arrays with shared memory must be pickled. Multiprocessing can bypass pickling the RawArray object, but the rest of the NumPy array is pickled. Using shared memory arrays have no speed advantage over normal NumPy arrays when we use multiprocessing. 6. It is much easier to write concurrent code that uses queues for message passing than anything else. That is why using a Queue object has been the popular Pythonic approach to both multitreading and multiprocessing. I would like this to continue. I am therefore focusing my effort on the multiprocessing.Queue object. If you understand the six points I listed you will see where this is going: What we really need is a specialized queue that has knowledge about NumPy arrays and can bypass pickle. I am therefore focusing my efforts on creating a NumPy aware queue object. We are not doing the users a favor by encouraging the use of shared memory arrays. They help with nothing. Sturla Molden Matěj Týč <matej.tyc@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 05/11/2016 04:29 AM, Sturla Molden wrote:
That's interesting. I've also used multiprocessing with numpy and didn't realize that. Is this true in python3 too? In python2 it appears that multiprocessing uses pickle protocol 0 which must cause a big slowdown (a factor of 100) relative to protocol 2, and uses pickle instead of cPickle. a = np.arange(40*40) %timeit pickle.dumps(a) 1000 loops, best of 3: 1.63 ms per loop %timeit cPickle.dumps(a) 1000 loops, best of 3: 1.56 ms per loop %timeit cPickle.dumps(a, protocol=2) 100000 loops, best of 3: 18.9 µs per loop Python 3 uses protocol 3 by default: %timeit pickle.dumps(a) 10000 loops, best of 3: 20 µs per loop
![](https://secure.gravatar.com/avatar/697900d3a29858ea20cc109a2aee0af6.jpg?s=120&d=mm&r=g)
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would imagine that it is because the array is huge, right? So, the pickling approach would copy that array for each process, which defeats the purpose, right? Ben Root On Wed, May 11, 2016 at 2:01 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
Hi, I've been thinking and exploring this for some time. If we are to start some effort I'd like to help. Here are my comments, mostly regarding to Sturla's comments. 1. If we are talking about shared memory and copy-on-write inheritance, then we are using 'fork'. If we are free to use fork, then a large chunk of the concerns regarding the python std library multiprocessing is no longer relevant. Especially those functions must be in a module limitation that tends to impose a special requirement on the software design. 2. Picking of inherited shared memory array can be done minimally by just picking the array_interface and the pointer address. It is because the child process and the parent share the same address space layout, guarenteed by the fork call. 3. The RawArray and RawValue implementation in std multiprocessing has its own memory allocator for managing small variables. It is a huge overkill (in terms of implementation) if we only care about very large memory chunks. 4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice is to defer the responsibility of avoiding racing to the developer. Simple structs for working on slices of array in parallel can cover a huge fraction of use cases and fully avoid this issue. 5. Whether to delegate parallelism to underlying low level implementation or to implement the paralellism in python while maintaining the underlying low level implementation sequential is probably dependent on the problem. It may be convenient as of the current state of parallelism support in Python to delegate, but will it forever be the case? For example, after the MPI FFTW binding stuck for a long time, someone wrote a parallel python FFT package (https://github.com/spectralDNS/mpiFFT4py) that uses FFTW for sequential and write all parallel semantics in Python with mpi4py, and it uses a more efficient domain decomposition. 6. If we are to define a set of operations I would recommend take a look at OpenMP as a reference -- It has been out there for decades and used widely. An equiavlant to the 'omp parallel for' construct in Python will be a very good starting point and immediately useful. - Yu On Wed, May 11, 2016 at 11:22 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
1. If we are talking about shared memory and copy-on-write inheritance, then we are using 'fork'.
Not available on Windows. On Unix it only allows one-way communication, from parent to child.
Again, not everyone uses Unix. And on Unix it is not trival to pass data back from the child process. I solved that problem with Sys V IPC (pickling the name of the segment).
If you are on Unix, you can just use a context manager. Call os.fork in __enter__ and os.waitpid in __exit__. Sturla
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Niki Spahiev <niki.spahiev@gmail.com> wrote:
Apparently next Win10 will have fork as part of bash integration.
It is Interix/SUA rebranded "Subsystem for Linux". It remains to be seen how long it will stay this time. Also a Python built for this subsystem will not run on the Win32 subsystem, so there is no graphics. Also it will not be installed by default, just like SUA.
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
I wonder if it is neccessary insist being able to pass large amount of data back from child to the parent process. In most (half?) situations the result can be directly write back via preallocated shared array before works are spawned. Then there is no need to pass data back with named segments. Here I am just doodling some possible use cases along the OpenMP line. The sample would just copy the data from s to r, in two different ways. On systems that does not support multiprocess + fork, the semantics is still well preserved if threading is used. ``` import ...... as mp # the access attribute of inherited variables is at least 'privatecopy' # but with threading backend it becomes 'shared' s = numpy.arange(10000) with mp.parallel(num_threads=8) as section: r = section.empty(10000) # variables defined via section.empty will always be 'shared' def work(): # variables defined in the body is 'private' tid = section.get_thread_num() size = section.get_num_threads() sl = slice(tid * r.size // size, (tid + 1) * r.size // size) r[sl] = s[sl] status = section.run(work) assert not any(status.errors) # the support to the following could be implemented with section.run chunksize = 1000 def work(i): sl = slice(i, i + chunksize) r[sl] = s[sl] return s[sl].sum() status = section.loop(work, range(0, r.size, chunksize), schedule='static') assert not any(status.errors) total = sum(status.results) ```
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
You can work around it in various ways, this being one of them. Personally I prefer a parallel programming style with queues – either to scatter arrays to workers and collecting arrays from workers, or to chain workers together in a pipeline (without using coroutines). But exactly how you program is a matter of taste. I want to make it as inexpensive as possible to pass a NumPy array through a queue. If anyone else wants to help improve parallel programming with NumPy using a different paradigm, that is fine too. I just wanted to clarify why I stopped working on shared memory arrays. (As for the implementation, I am also experimenting with platform dependent asynchronous I/O (IOCP, GCD or kqueue, epoll) to pass NumPy arrays though a queue as inexpensively and scalably as possible. And no, there is no public repo, as I like to experiment with my pet project undisturbed before I let it out in the wild.) Sturla
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
Even I am not very obsessed with functional and queues, I still have to agree with you queues tend to produce more readable and less verbose code -- if there is the right tool.
It will be wonderful if there is a way to pass numpy array around without a huge dependency list. After all, we know the address of the array and, in principle we are able to find the physical pages and map them in the receiver side. Also, did you checkout http://zeromq.org/blog:zero-copy ? ZeroMQ is a dependency of Jupyter, so it is quite available. - Yu
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
Also, did you checkout http://zeromq.org/blog:zero-copy ? ZeroMQ is a dependency of Jupyter, so it is quite available.
ZeroMQ is great, but it lacks some crucial features. In particular it does not support IPC on Windows. Ideally one should e.g. use Unix doman sockets on Linux and named pipes on Windows. Most MPI implementations seems to prefer shared memory over these mechanisms, though. Also I am not sure about ZeroMQ and asynch i/o. I would e.g. like to use IOCP on Windows, GCD on Mac, and a threadpool plus epoll on Linux. Sturla
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Benjamin Root <ben.v.root@gmail.com> wrote:
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would imagine that it is because the array is huge, right?
That is a case for shared memory, but what. i was taking about is more common than this. In order for processes to cooperate, they must communicate. So we need a way to pass around NumPy arrays quickly. Sometimes we want to use shared memory because of the size of the data, but more often it is just used as a form of inexpensive IPC.
I am not sure what you mean. When I made shared memory arrays I used named segments, and made sure only the name of the segments were pickled, not the contents of the buffers. Sturla
![](https://secure.gravatar.com/avatar/342bd0a61c7081db529c856d3bcd9545.jpg?s=120&d=mm&r=g)
In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and uses pickle instead of cPickle.
Even on Python 2.x, multiprocessing uses protocol 2, not protocol 0. The default for the `pickle` module changed, but multiprocessing has always used a binary pickle protocol to communicate between processes. Have a look at multiprocessing's forking.py in Python 2.7. As some context here for folks that may not be aware, Sturla is referring to his earlier shared memory implementation <https://github.com/sturlamolden/sharedmem-numpy> he wrote that avoids actually pickling the data, and instead essentially pickles a pointer to an array in shared memory. As Sturla very nicely summed up, it saves memory usage, but doesn't help the deeper issues. You're far better off just communicating between processes as opposed to using shared memory.
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 05/11/2016 06:39 PM, Joe Kington wrote:
Are you sure? As far as I understood the code, it uses the default protocol 0. The file forking.py no longer exists, also. https://github.com/python/cpython/tree/master/Lib/multiprocessing (see reduction.py and queue.py) http://bugs.python.org/issue23403
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Allan Haldane <allanhaldane@gmail.com> wrote:
That's interesting. I've also used multiprocessing with numpy and didn't realize that. Is this true in python3 too?
I am not sure. As you have noticed, pickle is faster by to orders of magnitude on Python 3. But several microseconds is also a lot, particularly if we are going to do this often during a computation. Sturla
![](https://secure.gravatar.com/avatar/695b15f15ed8500b23eee2985c1719e8.jpg?s=120&d=mm&r=g)
On 11.5.2016 10:29, Sturla Molden wrote:
I did some work on this some years ago. ...
I am sorry, I have missed this discussion when it started. There are two cases when I had feeling that I had to use this functionality: - Parallel processing of HUGE data, and - using parallel processing in an application that had plug-ins which operated on one shared array (that was updated every one and then - it was a producer-consumer pattern thing). As everything got set up, it worked like a charm. The thing I especially like about the proposed module is the lack of external dependencies + it works if one knows how to use it. The bad thing about it is its fragility - I admit that using it as it is is not particularly intuitive. Unlike Sturla, I think that this is not a dead end, but it indeed feels clumsy. However, I dislike the necessity of writing Cython or C to get true multithreading for reasons I have mentioned - what if you want to run high-level Python functions in parallel? So, what I would really like to see is some kind of numpy documentation on how to approach parallel computing with numpy arrays (depending on what kind of task one wants to achieve). Maybe just using the queue is good enough, or there are those 3-rd party modules with known limitations? Plenty of people start off with numpy, so some kind of overview should be part of numpy docs.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Matěj Týč <matej.tyc@gmail.com> wrote:
- Parallel processing of HUGE data, and
This is mainly a Windows problem, as copy-on-write fork() will solve this on any other platform. I am more in favor of asking Microsoft to fix their broken OS. Also observe that the usefulness of shared memory is very limited on Windows, as we in practice never get the same base address in a spawned process. This prevents sharing data structures with pointers and Python objects. Anything more complex than an array cannot be shared. What this means is that shared memory is seldom useful for sharing huge data, even on Windows. It is only useful for this on Unix/Linux, where base addresses can stay they same. But on non-Windows platforms, the COW will in 99.99% of the cases be sufficient, thus make shared memory superfluous anyway. We don't need shared memory to scatter large data on Linux, only fork. As I see it. shared memory is mostly useful as a means to construct an inter-process communication (IPC) protocol. Sturla
![](https://secure.gravatar.com/avatar/695b15f15ed8500b23eee2985c1719e8.jpg?s=120&d=mm&r=g)
On 17.5.2016 14:13, Sturla Molden wrote: that if you pass the numpy array to the child process using Queue, no significant amount of data will flow through it? Or I shouldn't pass it using Queue at all and just rely on inheritance? Finally, I assume that passing it as an argument to the Process class is the worst option, because it will be pickled and unpickled. Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface? And finally, cow means that returning large arrays still involves data moving between processes, whereas the shm approach has the workaround that you can preallocate the result array by the parent process, where the worker process can write to.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Matěj Týč <matej.tyc@gmail.com> wrote:
This is what my shared memory arrayes do.
Or I shouldn't pass it using Queue at all and just rely on inheritance?
This is what David Baddeley's shared memory arrays do.
My shared memory arrays only pickles the metadata, and can be used in this way.
Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface?
Joblib creates "share memory" by memory mapping a temporary file, which is back by RAM on Libux (tempfs). It is backed by a physical file on disk on Mac and Windows. In this resepect, joblib is much better on Linux than Mac or Windows.
My shared memory arrays need no workaround dor this. They also allow shared memory arrays to be returned to the parent process. No preallocation is needed. Sturla
![](https://secure.gravatar.com/avatar/93a76a800ef6c5919baa8ba91120ee98.jpg?s=120&d=mm&r=g)
On Mon, Apr 11, 2016 at 5:39 AM, Matěj Týč <matej.tyc@gmail.com> wrote:
OK, we can agree to disagree on this one. I still don't think I could get code using this pattern checked in at my work (for good reason).
* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
I guess I'm still not convinced this is the best we can with the multiprocessing library. If we're going to do this, then we definitely need to have the fully canonical example. For example, could you make the shared array a global variable and then still pass references to functions called by the processes anyways? The examples on stackoverflow that we're both looking are varied enough that it's not obvious to me that this is as good as it gets. * This needs tests and justification for custom pickling methods,
That sounds like a fine justification, but given that it wasn't obvious you needs a comment saying as much in the source code :). Also, it breaks pickle, which is another limitation that needs to be documented.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
I did some work on this some years ago. I have more or less concluded that it was a waste of effort. But first let me explain what the suggested approach do not work. As it uses memory mapping to create shared memory (i.e. shared segments are not named), they must be created ahead of spawning processes. But if you really want this to work smoothly, you want named shared memory (Sys V IPC or posix shm_open), so that shared arrays can be created in the spawned processes and passed back. Now for the reason I don't care about shared memory arrays anymore, and what I am currently working on instead: 1. I have come across very few cases where threaded code cannot be used in numerical computing. In fact, multithreading nearly always happens in the code where I write pure C or Fortran anyway. Most often it happens in library code that are already multithreaded (Intel MKL, Apple Accelerate Framework, OpenBLAS, etc.), which means using it requires no extra effort from my side. A multithreaded LAPACK library is not less multithreaded if I call it from Python. 2. Getting shared memory right can be difficult because of hierarchical memory and false sharing. You might not see it if you only have a multicore CPU with a shared cache. But your code might not scale up on computers with more than one physical processor. False sharing acts like the GIL, except it happens in hardware and affects your C code invisibly without any explicit locking you can pinpoint. This is also why MPI code tends to scale much better than OpenMP code. If nothing is shared there will be no false sharing. 3. Raw C level IPC is cheap – very, very cheap. Even if you use pipes or sockets instead of shared memory it is cheap. There are very few cases where the IPC tends to be a bottleneck. 4. The reason IPC appears expensive with NumPy is because multiprocessing pickles the arrays. It is pickle that is slow, not the IPC. Some would say that the pickle overhead is an integral part of the IPC ovearhead, but i will argue that it is not. The slowness of pickle is a separate problem alltogether. 5. Share memory does not improve on the pickle overhead because also NumPy arrays with shared memory must be pickled. Multiprocessing can bypass pickling the RawArray object, but the rest of the NumPy array is pickled. Using shared memory arrays have no speed advantage over normal NumPy arrays when we use multiprocessing. 6. It is much easier to write concurrent code that uses queues for message passing than anything else. That is why using a Queue object has been the popular Pythonic approach to both multitreading and multiprocessing. I would like this to continue. I am therefore focusing my effort on the multiprocessing.Queue object. If you understand the six points I listed you will see where this is going: What we really need is a specialized queue that has knowledge about NumPy arrays and can bypass pickle. I am therefore focusing my efforts on creating a NumPy aware queue object. We are not doing the users a favor by encouraging the use of shared memory arrays. They help with nothing. Sturla Molden Matěj Týč <matej.tyc@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 05/11/2016 04:29 AM, Sturla Molden wrote:
That's interesting. I've also used multiprocessing with numpy and didn't realize that. Is this true in python3 too? In python2 it appears that multiprocessing uses pickle protocol 0 which must cause a big slowdown (a factor of 100) relative to protocol 2, and uses pickle instead of cPickle. a = np.arange(40*40) %timeit pickle.dumps(a) 1000 loops, best of 3: 1.63 ms per loop %timeit cPickle.dumps(a) 1000 loops, best of 3: 1.56 ms per loop %timeit cPickle.dumps(a, protocol=2) 100000 loops, best of 3: 18.9 µs per loop Python 3 uses protocol 3 by default: %timeit pickle.dumps(a) 10000 loops, best of 3: 20 µs per loop
![](https://secure.gravatar.com/avatar/697900d3a29858ea20cc109a2aee0af6.jpg?s=120&d=mm&r=g)
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would imagine that it is because the array is huge, right? So, the pickling approach would copy that array for each process, which defeats the purpose, right? Ben Root On Wed, May 11, 2016 at 2:01 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
Hi, I've been thinking and exploring this for some time. If we are to start some effort I'd like to help. Here are my comments, mostly regarding to Sturla's comments. 1. If we are talking about shared memory and copy-on-write inheritance, then we are using 'fork'. If we are free to use fork, then a large chunk of the concerns regarding the python std library multiprocessing is no longer relevant. Especially those functions must be in a module limitation that tends to impose a special requirement on the software design. 2. Picking of inherited shared memory array can be done minimally by just picking the array_interface and the pointer address. It is because the child process and the parent share the same address space layout, guarenteed by the fork call. 3. The RawArray and RawValue implementation in std multiprocessing has its own memory allocator for managing small variables. It is a huge overkill (in terms of implementation) if we only care about very large memory chunks. 4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice is to defer the responsibility of avoiding racing to the developer. Simple structs for working on slices of array in parallel can cover a huge fraction of use cases and fully avoid this issue. 5. Whether to delegate parallelism to underlying low level implementation or to implement the paralellism in python while maintaining the underlying low level implementation sequential is probably dependent on the problem. It may be convenient as of the current state of parallelism support in Python to delegate, but will it forever be the case? For example, after the MPI FFTW binding stuck for a long time, someone wrote a parallel python FFT package (https://github.com/spectralDNS/mpiFFT4py) that uses FFTW for sequential and write all parallel semantics in Python with mpi4py, and it uses a more efficient domain decomposition. 6. If we are to define a set of operations I would recommend take a look at OpenMP as a reference -- It has been out there for decades and used widely. An equiavlant to the 'omp parallel for' construct in Python will be a very good starting point and immediately useful. - Yu On Wed, May 11, 2016 at 11:22 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
1. If we are talking about shared memory and copy-on-write inheritance, then we are using 'fork'.
Not available on Windows. On Unix it only allows one-way communication, from parent to child.
Again, not everyone uses Unix. And on Unix it is not trival to pass data back from the child process. I solved that problem with Sys V IPC (pickling the name of the segment).
If you are on Unix, you can just use a context manager. Call os.fork in __enter__ and os.waitpid in __exit__. Sturla
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Niki Spahiev <niki.spahiev@gmail.com> wrote:
Apparently next Win10 will have fork as part of bash integration.
It is Interix/SUA rebranded "Subsystem for Linux". It remains to be seen how long it will stay this time. Also a Python built for this subsystem will not run on the Win32 subsystem, so there is no graphics. Also it will not be installed by default, just like SUA.
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
I wonder if it is neccessary insist being able to pass large amount of data back from child to the parent process. In most (half?) situations the result can be directly write back via preallocated shared array before works are spawned. Then there is no need to pass data back with named segments. Here I am just doodling some possible use cases along the OpenMP line. The sample would just copy the data from s to r, in two different ways. On systems that does not support multiprocess + fork, the semantics is still well preserved if threading is used. ``` import ...... as mp # the access attribute of inherited variables is at least 'privatecopy' # but with threading backend it becomes 'shared' s = numpy.arange(10000) with mp.parallel(num_threads=8) as section: r = section.empty(10000) # variables defined via section.empty will always be 'shared' def work(): # variables defined in the body is 'private' tid = section.get_thread_num() size = section.get_num_threads() sl = slice(tid * r.size // size, (tid + 1) * r.size // size) r[sl] = s[sl] status = section.run(work) assert not any(status.errors) # the support to the following could be implemented with section.run chunksize = 1000 def work(i): sl = slice(i, i + chunksize) r[sl] = s[sl] return s[sl].sum() status = section.loop(work, range(0, r.size, chunksize), schedule='static') assert not any(status.errors) total = sum(status.results) ```
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
You can work around it in various ways, this being one of them. Personally I prefer a parallel programming style with queues – either to scatter arrays to workers and collecting arrays from workers, or to chain workers together in a pipeline (without using coroutines). But exactly how you program is a matter of taste. I want to make it as inexpensive as possible to pass a NumPy array through a queue. If anyone else wants to help improve parallel programming with NumPy using a different paradigm, that is fine too. I just wanted to clarify why I stopped working on shared memory arrays. (As for the implementation, I am also experimenting with platform dependent asynchronous I/O (IOCP, GCD or kqueue, epoll) to pass NumPy arrays though a queue as inexpensively and scalably as possible. And no, there is no public repo, as I like to experiment with my pet project undisturbed before I let it out in the wild.) Sturla
![](https://secure.gravatar.com/avatar/8af4bd459771202fb4a2e591645e4530.jpg?s=120&d=mm&r=g)
Even I am not very obsessed with functional and queues, I still have to agree with you queues tend to produce more readable and less verbose code -- if there is the right tool.
It will be wonderful if there is a way to pass numpy array around without a huge dependency list. After all, we know the address of the array and, in principle we are able to find the physical pages and map them in the receiver side. Also, did you checkout http://zeromq.org/blog:zero-copy ? ZeroMQ is a dependency of Jupyter, so it is quite available. - Yu
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Feng Yu <rainwoodman@gmail.com> wrote:
Also, did you checkout http://zeromq.org/blog:zero-copy ? ZeroMQ is a dependency of Jupyter, so it is quite available.
ZeroMQ is great, but it lacks some crucial features. In particular it does not support IPC on Windows. Ideally one should e.g. use Unix doman sockets on Linux and named pipes on Windows. Most MPI implementations seems to prefer shared memory over these mechanisms, though. Also I am not sure about ZeroMQ and asynch i/o. I would e.g. like to use IOCP on Windows, GCD on Mac, and a threadpool plus epoll on Linux. Sturla
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Benjamin Root <ben.v.root@gmail.com> wrote:
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would imagine that it is because the array is huge, right?
That is a case for shared memory, but what. i was taking about is more common than this. In order for processes to cooperate, they must communicate. So we need a way to pass around NumPy arrays quickly. Sometimes we want to use shared memory because of the size of the data, but more often it is just used as a form of inexpensive IPC.
I am not sure what you mean. When I made shared memory arrays I used named segments, and made sure only the name of the segments were pickled, not the contents of the buffers. Sturla
![](https://secure.gravatar.com/avatar/342bd0a61c7081db529c856d3bcd9545.jpg?s=120&d=mm&r=g)
In python2 it appears that multiprocessing uses pickle protocol 0 which
must cause a big slowdown (a factor of 100) relative to protocol 2, and uses pickle instead of cPickle.
Even on Python 2.x, multiprocessing uses protocol 2, not protocol 0. The default for the `pickle` module changed, but multiprocessing has always used a binary pickle protocol to communicate between processes. Have a look at multiprocessing's forking.py in Python 2.7. As some context here for folks that may not be aware, Sturla is referring to his earlier shared memory implementation <https://github.com/sturlamolden/sharedmem-numpy> he wrote that avoids actually pickling the data, and instead essentially pickles a pointer to an array in shared memory. As Sturla very nicely summed up, it saves memory usage, but doesn't help the deeper issues. You're far better off just communicating between processes as opposed to using shared memory.
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 05/11/2016 06:39 PM, Joe Kington wrote:
Are you sure? As far as I understood the code, it uses the default protocol 0. The file forking.py no longer exists, also. https://github.com/python/cpython/tree/master/Lib/multiprocessing (see reduction.py and queue.py) http://bugs.python.org/issue23403
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Allan Haldane <allanhaldane@gmail.com> wrote:
That's interesting. I've also used multiprocessing with numpy and didn't realize that. Is this true in python3 too?
I am not sure. As you have noticed, pickle is faster by to orders of magnitude on Python 3. But several microseconds is also a lot, particularly if we are going to do this often during a computation. Sturla
![](https://secure.gravatar.com/avatar/695b15f15ed8500b23eee2985c1719e8.jpg?s=120&d=mm&r=g)
On 11.5.2016 10:29, Sturla Molden wrote:
I did some work on this some years ago. ...
I am sorry, I have missed this discussion when it started. There are two cases when I had feeling that I had to use this functionality: - Parallel processing of HUGE data, and - using parallel processing in an application that had plug-ins which operated on one shared array (that was updated every one and then - it was a producer-consumer pattern thing). As everything got set up, it worked like a charm. The thing I especially like about the proposed module is the lack of external dependencies + it works if one knows how to use it. The bad thing about it is its fragility - I admit that using it as it is is not particularly intuitive. Unlike Sturla, I think that this is not a dead end, but it indeed feels clumsy. However, I dislike the necessity of writing Cython or C to get true multithreading for reasons I have mentioned - what if you want to run high-level Python functions in parallel? So, what I would really like to see is some kind of numpy documentation on how to approach parallel computing with numpy arrays (depending on what kind of task one wants to achieve). Maybe just using the queue is good enough, or there are those 3-rd party modules with known limitations? Plenty of people start off with numpy, so some kind of overview should be part of numpy docs.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Matěj Týč <matej.tyc@gmail.com> wrote:
- Parallel processing of HUGE data, and
This is mainly a Windows problem, as copy-on-write fork() will solve this on any other platform. I am more in favor of asking Microsoft to fix their broken OS. Also observe that the usefulness of shared memory is very limited on Windows, as we in practice never get the same base address in a spawned process. This prevents sharing data structures with pointers and Python objects. Anything more complex than an array cannot be shared. What this means is that shared memory is seldom useful for sharing huge data, even on Windows. It is only useful for this on Unix/Linux, where base addresses can stay they same. But on non-Windows platforms, the COW will in 99.99% of the cases be sufficient, thus make shared memory superfluous anyway. We don't need shared memory to scatter large data on Linux, only fork. As I see it. shared memory is mostly useful as a means to construct an inter-process communication (IPC) protocol. Sturla
![](https://secure.gravatar.com/avatar/695b15f15ed8500b23eee2985c1719e8.jpg?s=120&d=mm&r=g)
On 17.5.2016 14:13, Sturla Molden wrote: that if you pass the numpy array to the child process using Queue, no significant amount of data will flow through it? Or I shouldn't pass it using Queue at all and just rely on inheritance? Finally, I assume that passing it as an argument to the Process class is the worst option, because it will be pickled and unpickled. Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface? And finally, cow means that returning large arrays still involves data moving between processes, whereas the shm approach has the workaround that you can preallocate the result array by the parent process, where the worker process can write to.
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Matěj Týč <matej.tyc@gmail.com> wrote:
This is what my shared memory arrayes do.
Or I shouldn't pass it using Queue at all and just rely on inheritance?
This is what David Baddeley's shared memory arrays do.
My shared memory arrays only pickles the metadata, and can be used in this way.
Or maybe you refer to modules s.a. joblib that use this functionality and expose only a nice interface?
Joblib creates "share memory" by memory mapping a temporary file, which is back by RAM on Libux (tempfs). It is backed by a physical file on disk on Mac and Windows. In this resepect, joblib is much better on Linux than Mac or Windows.
My shared memory arrays need no workaround dor this. They also allow shared memory arrays to be returned to the parent process. No preallocation is needed. Sturla
participants (8)
-
Allan Haldane
-
Benjamin Root
-
Feng Yu
-
Joe Kington
-
Matěj Týč
-
Niki Spahiev
-
Stephan Hoyer
-
Sturla Molden