
Hi, I've been thinking and exploring this for some time. If we are to start some effort I'd like to help. Here are my comments, mostly regarding to Sturla's comments. 1. If we are talking about shared memory and copy-on-write inheritance, then we are using 'fork'. If we are free to use fork, then a large chunk of the concerns regarding the python std library multiprocessing is no longer relevant. Especially those functions must be in a module limitation that tends to impose a special requirement on the software design. 2. Picking of inherited shared memory array can be done minimally by just picking the array_interface and the pointer address. It is because the child process and the parent share the same address space layout, guarenteed by the fork call. 3. The RawArray and RawValue implementation in std multiprocessing has its own memory allocator for managing small variables. It is a huge overkill (in terms of implementation) if we only care about very large memory chunks. 4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice is to defer the responsibility of avoiding racing to the developer. Simple structs for working on slices of array in parallel can cover a huge fraction of use cases and fully avoid this issue. 5. Whether to delegate parallelism to underlying low level implementation or to implement the paralellism in python while maintaining the underlying low level implementation sequential is probably dependent on the problem. It may be convenient as of the current state of parallelism support in Python to delegate, but will it forever be the case? For example, after the MPI FFTW binding stuck for a long time, someone wrote a parallel python FFT package (https://github.com/spectralDNS/mpiFFT4py) that uses FFTW for sequential and write all parallel semantics in Python with mpi4py, and it uses a more efficient domain decomposition. 6. If we are to define a set of operations I would recommend take a look at OpenMP as a reference -- It has been out there for decades and used widely. An equiavlant to the 'omp parallel for' construct in Python will be a very good starting point and immediately useful. - Yu On Wed, May 11, 2016 at 11:22 AM, Benjamin Root <ben.v.root@gmail.com> wrote:
Oftentimes, if one needs to share numpy arrays for multiprocessing, I would imagine that it is because the array is huge, right? So, the pickling approach would copy that array for each process, which defeats the purpose, right?
Ben Root
On Wed, May 11, 2016 at 2:01 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 05/11/2016 04:29 AM, Sturla Molden wrote:
4. The reason IPC appears expensive with NumPy is because multiprocessing pickles the arrays. It is pickle that is slow, not the IPC. Some would say that the pickle overhead is an integral part of the IPC ovearhead, but i will argue that it is not. The slowness of pickle is a separate problem alltogether.
That's interesting. I've also used multiprocessing with numpy and didn't realize that. Is this true in python3 too?
In python2 it appears that multiprocessing uses pickle protocol 0 which must cause a big slowdown (a factor of 100) relative to protocol 2, and uses pickle instead of cPickle.
a = np.arange(40*40)
%timeit pickle.dumps(a) 1000 loops, best of 3: 1.63 ms per loop
%timeit cPickle.dumps(a) 1000 loops, best of 3: 1.56 ms per loop
%timeit cPickle.dumps(a, protocol=2) 100000 loops, best of 3: 18.9 µs per loop
Python 3 uses protocol 3 by default:
%timeit pickle.dumps(a) 10000 loops, best of 3: 20 µs per loop
5. Share memory does not improve on the pickle overhead because also NumPy arrays with shared memory must be pickled. Multiprocessing can bypass pickling the RawArray object, but the rest of the NumPy array is pickled. Using shared memory arrays have no speed advantage over normal NumPy arrays when we use multiprocessing.
6. It is much easier to write concurrent code that uses queues for message passing than anything else. That is why using a Queue object has been the popular Pythonic approach to both multitreading and multiprocessing. I would like this to continue.
I am therefore focusing my effort on the multiprocessing.Queue object. If you understand the six points I listed you will see where this is going: What we really need is a specialized queue that has knowledge about NumPy arrays and can bypass pickle. I am therefore focusing my efforts on creating a NumPy aware queue object.
We are not doing the users a favor by encouraging the use of shared memory arrays. They help with nothing.
Sturla Molden
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion