Shared memory ndarrays (update)

Here is an update for the shared memory arrays that Gaël and I wrote two years ago. They are NumPy arrays referencing shared memory, and IPC using multiprocessing.Queue is possible by monkey patching how ndarrays are pickled. Usage: import numpy as np import sharedmem as sm shared_array = sm.zeros(n) I.e. the only difference from ndarrays is that pickle.dumps and multiprocessing.Queue do not make a copy of the buffer, and that allocated memory is shared between professes (e.g. created with os.fork, subprocess or multiprocessing.) A named memory map of the paging file is used on Windows. Unix System V IPC is used on Linux/Unix (thanks to Philip Semanchuk for assistance). Changes: - 64-bit support. - Memory leak on Linux/Unix should be gone (monkey patch for os._exit). - Added a global lock as there are callbacks to Python (the GIL is not sufficient serialization). I need help with testing, particularly on Linux / Apple and with the most recent NumPy. I'm an idiot with build tools, hence no setup.py. Invoke Cython and then cc. Compile sharedmemory_sysv.pyx for Linux/Unix or sharedmemory_sysv.pyx and ntqueryobject.c for Windows. Regards, Sturla

Den 11.04.2011 01:20, skrev Sturla Molden:
I'm an idiot with build tools, hence no setup.py. Invoke Cython and then cc. Compile sharedmemory_sysv.pyx for Linux/Unix or sharedmemory_sysv.pyx and ntqueryobject.c for Windows.
Eh, that is sharedmemory_win.pyx and ntqueryobject.c for Windows :-) Sturla

Hey Sturla, It's really great that you are still working on that. I'll test the code under Linux. The scipy community has moved to github. If I create a repository under github and put the code on it, would you use it? If I find time, I'll add a setup.py. Gaël

On Mon, Apr 11, 2011 at 7:05 AM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Hey Sturla,
It's really great that you are still working on that. I'll test the code under Linux.
The scipy community has moved to github. If I create a repository under github and put the code on it, would you use it? If I find time, I'll add a setup.py.
Gaël
Hi, just wanted to say that I find this module, and shared memory in general very interesting, because I work with very large image data sets. I have a non python question: for Java there seems to exist a module/package/class called nio http://download.oracle.com/javase/1.4.2/docs/api/java/nio/MappedByteBuffer.h... public abstract class MappedByteBuffer extends ByteBuffer A direct byte buffer whose content is a memory-mapped region of a file. Could this be used to share memory between Java applications and numpy ? Just hoping that someone here knows Java much better than me. - Sebastian Haase

On 04/11/2011 09:21 AM, Sebastian Haase wrote:
I have a non python question: for Java there seems to exist a module/package/class called nio
http://download.oracle.com/javase/1.4.2/docs/api/java/nio/MappedByteBuffer.h... public abstract class MappedByteBuffer extends ByteBuffer A direct byte buffer whose content is a memory-mapped region of a file.
Could this be used to share memory between Java applications and numpy ? Hi, it could, but you'd have to do the parsing of data yourself. So nothing fancy unless you want to reimplement numpy in Java :). But if you use a mmaped file as a backing for a numpy array of one of the types also available in Java (signed byte, short, int, long, float, double) and then map it on the Java side with DoubleBuffer or FloatBuffer or ..., then it seems straightforward enough.
Zbyszek

Hi everyone, I was looking up the options that are available for shared memory arrays and this thread came up at the right time. The doc says that multiprocessing .Array(...) gives a shared memory array. But from the code it seems to me that it is actually using a mmap. Is that correct a correct assessment, and if so, is there any advantage in using multiprocessing.Array(...) over simple numpy mmaped arrays. Regards srean

Apologies for adding to my own post. multiprocessing.Array(...) uses an anonymous mmapped file. I am not sure if that means it is resident on RAM or the swap device. But my original question remains, what are the pros and cons of using it versus numpy mmapped arrays. If multiprocessing.Array is indeed resident in memory (subject to swapping of course) that would still be advatageous compared to a file mapped from a on-disk filesystem. On Mon, Apr 11, 2011 at 12:42 PM, srean <srean.list@gmail.com> wrote:
Hi everyone,
I was looking up the options that are available for shared memory arrays and this thread came up at the right time. The doc says thatmultiprocessing .Array(...) gives a shared memory array. But from the code it seems to me that it is actually using a mmap. Is that correct a correct assessment, and if so, is there any advantage in using multiprocessing.Array(...) over simple numpy mmaped arrays.
Regards srean

"Shared memory" is memory mapping from the paging file (i.e. RAM), not a file on disk. They can have a name or be anonymous. I have explained why we need named shared memory before. If you didn't understand it, try to pass an instance of |multiprocessing.Array over | |multiprocessing.Queue. |Sturla Den 11.04.2011 20:11, skrev srean:
Apologies for adding to my own post. |multiprocessing.Array(...) uses an anonymous mmapped file. I am not sure if that means it is resident on RAM or the swap device. But my original question remains, what are the pros and cons of using it versus numpy mmapped arrays. If ||multiprocessing.Arrayis indeed resident in memory (subject to swapping of course) that would still be advatageous compared to a file mapped from a on-disk filesystem.|
On Mon, Apr 11, 2011 at 12:42 PM, srean <srean.list@gmail.com <mailto:srean.list@gmail.com>> wrote:
Hi everyone,
I was looking up the options that are available for shared memory arrays and this thread came up at the right time. The doc says that|multiprocessing.Array(...) gives a shared memory array. But from the code it seems to me that it is actually using a mmap. Is that correct a correct assessment, and if so, is there any advantage in using ||multiprocessing.Array(...)over simple numpy mmaped arrays.
Regards srean |||
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Got you and thanks a lot for the explanation. I am not using Queues so I think I am safe for the time being. Given that you have worked a lot on these issues, would you recommend plain mmapped numpy arrays over multiprocessing.Array Thanks again -- srean On Mon, Apr 11, 2011 at 1:36 PM, Sturla Molden <sturla@molden.no> wrote:
"Shared memory" is memory mapping from the paging file (i.e. RAM), not a file on disk. They can have a name or be anonymous. I have explained why we need named shared memory before. If you didn't understand it, try to pass an instance of multiprocessing.Array over multiprocessing.Queue.
Sturla

Den 11.04.2011 21:15, skrev srean:
Got you and thanks a lot for the explanation. I am not using Queues so I think I am safe for the time being. Given that you have worked a lot on these issues, would you recommend plain mmapped numpy arrays over |multiprocessing.Array| | With multiprocessing you must use multiprocessing.Array.
With os.fork you can use either. Sturla |

Den 11.04.2011 14:58, skrev Zbigniew Jędrzejewski-Szmek:
Hi, it could, but you'd have to do the parsing of data yourself. So nothing fancy unless you want to reimplement numpy in Java :)
Not really. Only the data buffer is stored in shared memory. If you can pass the required fields to Java (shape, strides, sagment name, offset to first element), the memory mappred ndarray can be used by Java. (Doing it with JNI is trivial, I am not sure what Java's standard library supports.) Interop with C++, Fortran, C#, Matalab, whatever, can be done similarly.
. But if you use a mmaped file as a backing for a numpy array of one of the types also available in Java (signed byte, short, int, long, float, double) and then map it on the Java side with DoubleBuffer or FloatBuffer or ..., then it seems straightforward enough.
Which is what we are doing, except that we are not memory mapping a physical file but a RAM segment with a filename. Sturla

Den 11.04.2011 01:20, skrev Sturla Molden:
Changes:
- 64-bit support. - Memory leak on Linux/Unix should be gone (monkey patch for os._exit). - Added a global lock as there are callbacks to Python (the GIL is not sufficient serialization).
I will also add a barrier synchronization primitive to this (as it is very useful for numerical computing, more so than mutexes or semaphores) and a work scheduler, so it will be easy to distribute work between the processes. The barrier is implemented with atomic read/writes on top of shared memory, so it can be sent over multiprocessing.Queue. It is also possible to implement locks, semaphores, events, etc., the same way. Observe that the synchronization primitives in multiprocessing, e.g. multiprocessing.RLock, cannot be sent over a Queue. Named shared memory takes that restriction away. Thus, objects that are pickled for multiprocessing.Queue can contain locks, events, barriers, etc. Sturla
participants (5)
-
Gael Varoquaux
-
Sebastian Haase
-
srean
-
Sturla Molden
-
Zbigniew Jędrzejewski-Szmek