multiprocessing shared arrays and numpy

Hi people, I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that: "This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook." http://www.scipy.org/Cookbook/multiprocessing I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is: Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it? Best regards, Jesper

A Wednesday 03 March 2010 15:31:29 Jesper Larsen escrigué:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Yes, it is pretty easy if your problem can be vectorised. Just split your arrays in chunks and assign the computation of each chunk to a different process. I'm attaching a code that does this for computing a polynomial on a certain range. Here it is the output (for a dual-core processor): Serial computation... 10000000 0 Time elapsed in serial computation: 3.438 3333333 0 3333334 1 3333333 2 Time elapsed in parallel computation: 2.271 with 3 threads Speed-up: 1.51x -- Francesc Alted

There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today. Nadav On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Maybe the attached file can help. Adpted and tested on amd64 linux Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today. Nadav On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0 Nadav. -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy Maybe the attached file can help. Adpted and tested on amd64 linux Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today. Nadav On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

What kind of calculations are you doing with this module? Can you please send some examples and the speed-ups you are getting? Thanks, Francesc A Thursday 04 March 2010 14:06:34 Nadav Horesh escrigué:
Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy
Maybe the attached file can help. Adpted and tested on amd64 linux
Nadav
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today.
Nadav
On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted

I can not give a reliable answer yet, since I have some more improvement to make. The application is an analysis of a stereoscopic-movie raw-data recording (both channels are recorded in the same file). I treat the data as a huge memory mapped file. The idea was to process each channel (left and right) on a different core. Right now the application is IO bounded since I do classical numpy operation, so each channel (which is handled as one array) is scanned several time. The improvement now over a single process is 10%, but I hope to achieve 10% ore after trivial optimizations. I used this application as an excuse to dive into multi-processing. I hope that the code I posted here would help someone. Nadav. -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Francesc Alted Sent: Thu 04-Mar-10 15:12 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy What kind of calculations are you doing with this module? Can you please send some examples and the speed-ups you are getting? Thanks, Francesc A Thursday 04 March 2010 14:06:34 Nadav Horesh escrigué:
Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy
Maybe the attached file can help. Adpted and tested on amd64 linux
Nadav
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today.
Nadav
On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes. Thanks for sharing your experience anyway, Francesc A Thursday 04 March 2010 18:54:09 Nadav Horesh escrigué:
I can not give a reliable answer yet, since I have some more improvement to make. The application is an analysis of a stereoscopic-movie raw-data recording (both channels are recorded in the same file). I treat the data as a huge memory mapped file. The idea was to process each channel (left and right) on a different core. Right now the application is IO bounded since I do classical numpy operation, so each channel (which is handled as one array) is scanned several time. The improvement now over a single process is 10%, but I hope to achieve 10% ore after trivial optimizations.
I used this application as an excuse to dive into multi-processing. I hope that the code I posted here would help someone.
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Francesc Alted Sent: Thu 04-Mar-10 15:12 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
What kind of calculations are you doing with this module? Can you please send some examples and the speed-ups you are getting?
Thanks, Francesc
A Thursday 04 March 2010 14:06:34 Nadav Horesh escrigué:
Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy
Maybe the attached file can help. Adpted and tested on amd64 linux
Nadav
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today.
Nadav
On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted

On Fri, Mar 05, 2010 at 09:53:02AM +0100, Francesc Alted wrote:
Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
Hey Francesc, Any chance this can be different for NUMA (non uniform memory access) architectures? AMD multicores used to be NUMA, when I was still following these problems. FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons. Gael

Gael, On Fri, Mar 05, 2010 at 10:51:12AM +0100, Gael Varoquaux wrote:
On Fri, Mar 05, 2010 at 09:53:02AM +0100, Francesc Alted wrote:
Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
Hey Francesc,
Any chance this can be different for NUMA (non uniform memory access) architectures? AMD multicores used to be NUMA, when I was still following these problems.
As far as I can tell, NUMA architectures work better accelerating independent processes that run independently one of each other. In this case, hardware is in charge of putting closely-related data in memory that is 'nearer' to each processor. This scenario *could* happen in truly parallel process too, but as I said, in general it works best for independent processes (read multiuser machines).
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think. Do you get much better speed-up by using NUMA than a simple multi-core machine with one single path to memory? I don't think so, but maybe I'm wrong here. Francesc

On Fri, Mar 05, 2010 at 08:14:51AM -0500, Francesc Alted wrote:
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think.
That's the only explaination that I can think of. I have two types of bottlenecks. One is blas level 3 operations (mainly SVDs) on large matrices, the second is resampling, where are repeat the same operation many times over almost the same chunk of data. In both cases the data is fairly large, so I expected the operations to be memory bound. However, thinking of it, I believe that when I had timed these operations carefully, it seems that processes were alternating a starving period, during which they were IO-bound, and a productive period, during which they were CPU-bound. After a few cycles, the different periods would fall in a mutually disynchronised alternation, with one process IO-bound, and the others CPU-bound, and it would become fairly efficient. Of course, this is possible because I have no cross-talk between the processes.
Do you get much better speed-up by using NUMA than a simple multi-core machine with one single path to memory? I don't think so, but maybe I'm wrong here.
I don't know. All the boxes around here have Intel CPUs, and I believe that this is all SMPs. Gaël

A Friday 05 March 2010 14:46:00 Gael Varoquaux escrigué:
On Fri, Mar 05, 2010 at 08:14:51AM -0500, Francesc Alted wrote:
FWIW, I observe very good speedups on my problems (pretty much linear in the number of CPUs), and I have data parallel problems on fairly large data (~100Mo a piece, doesn't fit in cache), with no synchronisation at all between the workers. CPUs are Intel Xeons.
Maybe your processes are not as memory-bound as you think.
That's the only explaination that I can think of. I have two types of bottlenecks. One is blas level 3 operations (mainly SVDs) on large matrices, the second is resampling, where are repeat the same operation many times over almost the same chunk of data. In both cases the data is fairly large, so I expected the operations to be memory bound.
Not at all. BLAS 3 operations are mainly CPU-bounded, because algorithms (if they are correctly implemented, of course, but any decent BLAS 3 library will do) have many chances to reuse data from caches. BLAS 1 (and lately 2 too) are the ones that are memory-bound. And in your second case, you are repeating the same operation over the same chunk of data. If this chunk is small enough to fit in cache, then the bottleneck is CPU again (and probably access to L1/L2 cache), and not access to memory. But if, as you said, you are seeing periods that are memory- bounded (i.e. CPUs are starving), then it may well be that this chunksize does not fit well in cache, and then your problem is memory access for this case. Maybe you can get better performance by reducing your chunksize so that it fits in cache (L1 or L2). So, I do not think that NUMA architectures would perform your current computations any better than your current SMP platform (and you know that NUMA architectures are much more complex and expensive than SMP ones). But experimenting is *always* the best answer to these hairy questions ;-) -- Francesc Alted

Francesc, Yeah, 10% of improvement by using multi-cores is an expected figure for
memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
+1 Thanks for emphasizing this. This is definitely a big issue with multicore. Cheers, Brian
Thanks for sharing your experience anyway, Francesc
A Thursday 04 March 2010 18:54:09 Nadav Horesh escrigué:
I can not give a reliable answer yet, since I have some more improvement to make. The application is an analysis of a stereoscopic-movie raw-data recording (both channels are recorded in the same file). I treat the data as a huge memory mapped file. The idea was to process each channel (left and right) on a different core. Right now the application is IO bounded since I do classical numpy operation, so each channel (which is handled as one array) is scanned several time. The improvement now over a single process is 10%, but I hope to achieve 10% ore after trivial optimizations.
I used this application as an excuse to dive into multi-processing. I hope that the code I posted here would help someone.
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Francesc Alted Sent: Thu 04-Mar-10 15:12 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
What kind of calculations are you doing with this module? Can you please send some examples and the speed-ups you are getting?
Thanks, Francesc
A Thursday 04 March 2010 14:06:34 Nadav Horesh escrigué:
Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy
Maybe the attached file can help. Adpted and tested on amd64 linux
Nadav
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today.
Nadav
On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I did some optimization, and the results are very instructive, although not surprising: javascript:SetCmd(cmdSend); As I wrote before, I processed stereoscopic movie recordings, by making each a memory mapped file and processing it in several steps. By this way I produced extra GB of transient data. Running as one process took 45 seconds, and in dual parallel process ~40 seconds. After rewriting the application to process the recording frame by frame. The code became shorter and the new scores are: One process --- 16 seconds, and dual process --- 9 seconds. What I learned: * Design for multi-procssing from the start, not as afterthought * Shared memory works, but on the expense of code elegance (much like common blocks in fortran) * Memory mapped files can be used much as shared memory. The strange thing is that I got an ignored AttributeError on every frame access to the memory mapped file from the child process. Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Brian Granger Sent: Fri 05-Mar-10 21:29 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy Francesc, Yeah, 10% of improvement by using multi-cores is an expected figure for
memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
+1 Thanks for emphasizing this. This is definitely a big issue with multicore. Cheers, Brian
Thanks for sharing your experience anyway, Francesc
A Thursday 04 March 2010 18:54:09 Nadav Horesh escrigué:
I can not give a reliable answer yet, since I have some more improvement to make. The application is an analysis of a stereoscopic-movie raw-data recording (both channels are recorded in the same file). I treat the data as a huge memory mapped file. The idea was to process each channel (left and right) on a different core. Right now the application is IO bounded since I do classical numpy operation, so each channel (which is handled as one array) is scanned several time. The improvement now over a single process is 10%, but I hope to achieve 10% ore after trivial optimizations.
I used this application as an excuse to dive into multi-processing. I hope that the code I posted here would help someone.
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Francesc Alted Sent: Thu 04-Mar-10 15:12 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
What kind of calculations are you doing with this module? Can you please send some examples and the speed-ups you are getting?
Thanks, Francesc
A Thursday 04 March 2010 14:06:34 Nadav Horesh escrigué:
Extended module that I used for some useful work. Comments: 1. Sturla's module is better designed, but did not work with very large (although sub GB) arrays 2. Tested on 64 bit linux (amd64) + python-2.6.4 + numpy-1.4.0
Nadav.
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 11:55 To: Discussion of Numerical Python Subject: RE: [Numpy-discussion] multiprocessing shared arrays and numpy
Maybe the attached file can help. Adpted and tested on amd64 linux
Nadav
-----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Nadav Horesh Sent: Thu 04-Mar-10 10:54 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy
There is a work by Sturla Molden: look for multiprocessing-tutorial.pdf and sharedmem-feb13-2009.zip. The tutorial includes what is dropped in the cookbook page. I am into the same issue and going to test it today.
Nadav
On Wed, 2010-03-03 at 15:31 +0100, Jesper Larsen wrote:
Hi people,
I was wondering about the status of using the standard library multiprocessing module with numpy. I found a cookbook example last updated one year ago which states that:
"This page was obsolete as multiprocessing's internals have changed. More information will come shortly; a link to this page will then be added back to the Cookbook."
http://www.scipy.org/Cookbook/multiprocessing
I also found the code that used to be on this page in the cookbook but it does not work any more. So my question is:
Is it possible to use numpy arrays as shared arrays in an application using multiprocessing and how do you do it?
Best regards, Jesper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Fri, Mar 5, 2010 at 7:29 PM, Brian Granger <ellisonbg.net@gmail.com> wrote:
Francesc,
Yeah, 10% of improvement by using multi-cores is an expected figure for memory bound problems. This is something people must know: if their computations are memory bound (and this is much more common that one may initially think), then they should not expect significant speed-ups on their parallel codes.
+1
Thanks for emphasizing this. This is definitely a big issue with multicore.
Cheers,
Brian
Hi, here's a few notes... A) cache B) multiple cores/cpu multiplies other optimisations. A) Understanding cache is also very useful. Cache at two levels: 1. disk cache. 2. cpu/core cache. 1. Mmap'd files are useful since you can reuse disk cache as program memory. So large files don't waste ram on the disk cache. For example, processing a 1 gig file can use 1GB of memory with mmap, but 2GB without. ps, learn about madvise for extra goodness :) mmap behaviour is very different on windows/linux/mac osx. The best mmap implementation is on linux. Note, that on some OS's the disk cache has separate reserved areas of memory which processes can not use... so mmap is the easiest way to access it. mmaping on SSDs is also quite fast :) 2. cpu cache is what can give you a speedup when you use extra cpus/cores. There are a number of different cpu architectures these days... but generally you will get a speed up if your cpus access different areas of memory. So don't get cpu1 to process one part of data, then cpu2 - otherwise the cache can get invalidated. Especially if you have a 8MB cache per cpu :) This is why the Xeons, and other high end cpus will give you numpy speedups more easily. Also consider processing in chunks less than the size of your cache (especially for multi pass arguments). There's a lot to caching, but I think the above gives enough useful hints :) B) Also, multiple processes can multiply the effects of your other optimisations. A 2x speed up via SSE or other SIMD can be multiplied over each cpu/core. So if you code gets 8x faster with multiple processes, then the 2x optimisation is likely a 16x speed up. The following is a common with optimisation pattern with python code.
From python to numpy you can get a 20x speedup. From numpy to C/C++ you can get up to 5 times speed up (or 50x over python). Then an asm optimisation is 2-4x faster again.
So up to 200x faster compared to pure python... then multiply that by 8x, and you have up to 1600x faster code :) Also small optimisations add up... a small 0.2 times speedup can turn into a 1.6 times speed up easily when you have multiple cpus. So as you can see... multiple cores makes it EASIER to optimise programs, since your optimisations are often multiplied. cu,

On Sun, Mar 07, 2010 at 07:00:03PM +0000, René Dudfield wrote:
1. Mmap'd files are useful since you can reuse disk cache as program memory. So large files don't waste ram on the disk cache.
I second that. mmaping has worked very well for me for large datasets, especialy in the context of reducing memory pressure. Gaël

A Sunday 07 March 2010 20:03:21 Gael Varoquaux escrigué:
On Sun, Mar 07, 2010 at 07:00:03PM +0000, René Dudfield wrote:
1. Mmap'd files are useful since you can reuse disk cache as program memory. So large files don't waste ram on the disk cache.
I second that. mmaping has worked very well for me for large datasets, especialy in the context of reducing memory pressure.
As far as I know, memmap files (or better, the underlying OS) *use* all available RAM for loading data until RAM is exhausted and then start to use SWAP, so the "memory pressure" is still there. But I may be wrong... -- Francesc Alted

On Thu, Mar 11, 2010 at 10:04:36AM +0100, Francesc Alted wrote:
As far as I know, memmap files (or better, the underlying OS) *use* all available RAM for loading data until RAM is exhausted and then start to use SWAP, so the "memory pressure" is still there. But I may be wrong...
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case. Gaël

Here is a strange thing I am getting with multiprocessing and memory mapped array: The below script generates the error message 30 times (for every slice access): Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(2949995000.0)> ignored Although I get the correct answer eventually. ------------------------------------------------------ import numpy as N import multiprocessing as MP def average(cube): return [plane.mean() for plane in cube] N.arange(30*100*100, dtype=N.int32).tofile(open('30x100x100_int32.dat','w')) data = N.memmap('30x100x100_int32.dat', dtype=N.int32, shape=(30,100,100)) pool = MP.Pool(processes=1) job = pool.apply_async(average, [data,]) print job.get() ------------------------------------------------------ I use python 2.6.4 and numpy 1.4.0 on 64 bit linux (amd64) Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org on behalf of Gael Varoquaux Sent: Thu 11-Mar-10 11:36 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] multiprocessing shared arrays and numpy On Thu, Mar 11, 2010 at 10:04:36AM +0100, Francesc Alted wrote:
As far as I know, memmap files (or better, the underlying OS) *use* all available RAM for loading data until RAM is exhausted and then start to use SWAP, so the "memory pressure" is still there. But I may be wrong...
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case. Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

A Thursday 11 March 2010 10:36:42 Gael Varoquaux escrigué:
On Thu, Mar 11, 2010 at 10:04:36AM +0100, Francesc Alted wrote:
As far as I know, memmap files (or better, the underlying OS) *use* all available RAM for loading data until RAM is exhausted and then start to use SWAP, so the "memory pressure" is still there. But I may be wrong...
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case.
Mmh, this is not my experience. During the past month, I was proposing in a course the students to compare the memory consumption of numpy.memmap and tables.Expr (a module for performing out-of-memory computations in PyTables). The idea was precisely to show that, contrarily to tables.Expr, numpy.memmap computations do take a lot of memory when they are being accessed. I'm attaching a slightly modified version of that exercise. On it, one have to compute a polynomial in a certain range. Here it is the output of the script for the numpy.memmap case for a machine with 8 GB RAM and 6 GB of swap: Total size for datasets: 7629.4 MB Populating x using numpy.memmap with 500000000 points... Total file sizes: 4000000000 -- (3814.7 MB) *** Time elapsed populating: 70.982 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using numpy.memmap Total file sizes: 8000000000 -- (7629.4 MB) **** Time elapsed computing: 81.727 10.08user 13.37system 2:33.26elapsed 15%CPU (0avgtext+0avgdata 0maxresident)k 7808inputs+15625008outputs (39major+5750196minor)pagefaults 0swaps While the computation was going on, I've spied the process with the top utility, and that told me that the total virtual size consumed by the Python process was 7.9 GB, with a total of *resident* memory of 6.7 GB (!). And this should not only be a top malfunction because I've checked that, by the end of the computation, my machine started to swap some processes out (i.e. the working set above was too large to allow the OS keep everything in memory). Now, just for the sake of comparison, I've tried running the same script but using tables.Expr. Here it is the output: Total size for datasets: 7629.4 MB Populating x using tables.Expr with 500000000 points... Total file sizes: 4000631280 -- (3815.3 MB) *** Time elapsed populating: 78.817 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr Total file sizes: 8001261168 -- (7630.6 MB) **** Time elapsed computing: 155.836 13.11user 18.59system 3:58.61elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k 7842784inputs+15632208outputs (28major+940347minor)pagefaults 0swaps and top was telling me that memory consumption was 148 MB for total virtual size and just 44 MB (as expected, because computation was really made using an out-of-core algorithm). Interestingly, when using compression (Blosc level 4, in this case), the time to do the computation with tables.Expr has reduced a lot: Total size for datasets: 7629.4 MB Populating x using tables.Expr with 500000000 points... Total file sizes: 1080130765 -- (1030.1 MB) *** Time elapsed populating: 30.005 Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr Total file sizes: 2415761895 -- (2303.9 MB) **** Time elapsed computing: 40.048 37.11user 6.98system 1:12.88elapsed 60%CPU (0avgtext+0avgdata 0maxresident)k 45312inputs+4720568outputs (4major+989323minor)pagefaults 0swaps while memory consumption is barely the same than above: 148 MB / 45 MB. So, in my experience, numpy.memmap is really using that large chunk of memory (unless my testbed is badly programmed, in which case I'd be grateful if you can point out what's wrong). -- Francesc Alted

On Thu, Mar 11, 2010 at 02:26:49PM +0100, Francesc Alted wrote:
I believe that your above assertion is 'half' right. First I think that it is not SWAP that the memapped file uses, but the original disk space, thus you avoid running out of SWAP. Second, if you open several times the same data without memmapping, I believe that it will be duplicated in memory. On the other hand, when you memapping, it is not duplicated, thus if you are running several processing jobs on the same data, you save memory. I am very much in this case.
Mmh, this is not my experience. During the past month, I was proposing in a course the students to compare the memory consumption of numpy.memmap and tables.Expr (a module for performing out-of-memory computations in PyTables).
[snip]
So, in my experience, numpy.memmap is really using that large chunk of memory (unless my testbed is badly programmed, in which case I'd be grateful if you can point out what's wrong).
OK, so what you are saying is that my assertion #1 was wrong. Fair enough, as I was writing it I was thinking that I had no hard fact to back it. How about assertion #2? I can think only of this 'story' to explain why I can run parallel computation when I use memmap that blow up if I don't use memmap. Also, could it be that the memmap mode changes things? I use only the 'r' mode, which is read-only. This is all very interesting, and you have much more insights on these problems than me. Would you be interested in coming to Euroscipy in Paris to give a 1 or 2 hours long tutorial on memory and IO problems and how you address them with Pytables? It would be absolutely thrilling. I must warn that I am afraid that we won't be able to pay for your trip, though, as I want to keep the price of the conference low. Best, Gaël

A Thursday 11 March 2010 14:35:49 Gael Varoquaux escrigué:
So, in my experience, numpy.memmap is really using that large chunk of memory (unless my testbed is badly programmed, in which case I'd be grateful if you can point out what's wrong).
OK, so what you are saying is that my assertion #1 was wrong. Fair enough, as I was writing it I was thinking that I had no hard fact to back it. How about assertion #2? I can think only of this 'story' to explain why I can run parallel computation when I use memmap that blow up if I don't use memmap.
Well, I must tell that I've not experience about running memmapped arrays in parallel computations, but it sounds like they can actually behave as shared- memory arrays, so yes, you may definitely be right for #2, i.e. memmapped data is not duplicated when accessed in parallel by different processes (in read- only mode, of course), which is certainly a very interesting technique to share data in parallel processes. Thanks for pointing out this!
Also, could it be that the memmap mode changes things? I use only the 'r' mode, which is read-only.
I don't think so. When doing the computation, I open the x values in read- only mode, and memory consumption is still there.
This is all very interesting, and you have much more insights on these problems than me. Would you be interested in coming to Euroscipy in Paris to give a 1 or 2 hours long tutorial on memory and IO problems and how you address them with Pytables? It would be absolutely thrilling. I must warn that I am afraid that we won't be able to pay for your trip, though, as I want to keep the price of the conference low.
Yes, no problem. I was already thinking about presenting something at EuroSciPy. A tutorial about PyTables/memory IO would be really great for me. We can nail the details off-list. -- Francesc Alted
participants (6)
-
Brian Granger
-
Francesc Alted
-
Gael Varoquaux
-
Jesper Larsen
-
Nadav Horesh
-
René Dudfield