[Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication

Thu Jun 16 12:44:05 EDT 2011

NOTE: I'm only taking part in this discussion because it's interesting 
and I hope to learn something. I do hope the OP chimes back in to 
clarify his needs, but in the meantime...

Bruce Southey wrote:
> Remember that is what the OP wanted to do, not me.

Actually, I don't think that's what the OP wanted -- I think we have a 
conflict between the need for concrete examples, and the desire to find 
a generic solution, so I think this is what the OP wants:

How to best multiprocess a _generic_ operation that needs to be 
performed on a lot of arrays. Something like:

output = []
for a in a_bunch_of_arrays:
   output.append( a_function(a) )

More specifically, a_function() is an inner product, *defined by the user*.

So there is no way to optimize the inner product itself (that will be up 
to the user), nor any way to generally convert the bunch_of_arrays to a 
single array with a single higher-dimensional operation.

In testing his approach, the OP used a numpy multiply, and a simple, 
loop-through-the elements multiply, and found that with his 
multiprocessing calls, the simple loop was a fair bit faster with two 
processors, but that the numpy one was slower with two processors. Of 
course, the looping method was much, much, slower than the numpy one in 
any case.

So Sturla's comments are probably right on:

Sturla Molden wrote:

> "innerProductList = pool.map(myutil.numpy_inner_product, arrayList)"
> 
> 1.  Here we potentially have a case of false sharing and/or mutex 
> contention, as the work is too fine grained.  pool.map does not do any 
> load balancing. If pool.map is to scale nicely, each work item must take 
> a substantial amount of time. I suspect this is the main issue.

> 2. There is also the question of when the process pool is spawned. 
> Though I haven't checked, I suspect it happens prior to calling 
> pool.map. But if it does not, this is a factor as well, particularly on 
> Windows (less so on Linux and Apple).

It didn't work well on my Mac, so ti's either not an issue, or not 
Windows-specific, anyway.

> 3.  "arrayList" is serialised by pickling, which has a significan 
> overhead.  It's not shared memory either, as the OP's code implies, but 
> the main thing is the slowness of cPickle.

I'll bet this is a big issue, and one I'm curious about how to address, 
I have another problem where I need to multi-process, and I'd love to 
know a way to pass data to the other process and back *without* going 
through pickle. maybe memmapped files?

> "IPs = N.array(innerProductList)"
> 
> 4.  numpy.array is a very slow function. The benchmark should preferably 
> not include this overhead.

I re-ran, moving that out of the timing loop, and, indeed, it helped a 
lot, but it still takes longer with the multi-processing.

I suspect that the overhead of pickling, etc. is overwhelming the 
operation itself. That and the load balancing issue that I don't understand!

To test this, I did a little experiment -- creating a "fake" operation, 
one that simply returns an element from the input array -- so it should 
take next to no time, and we can time the overhead of the pickling, etc:

$ python shared_mem.py

Using 2 processes
No shared memory, numpy array multiplication took 0.124427080154 seconds
Shared memory, numpy array multiplication took 0.586215019226 seconds

No shared memory, fake array multiplication took 0.000391006469727 seconds
Shared memory, fake array multiplication took 0.54935503006 seconds

No shared memory, my array multiplication took 23.5055780411 seconds
Shared memory, my array multiplication took 13.0932741165 seconds

Bingo!

The overhead of the multi-processing takes about .54 seconds, which 
explains the slowdown for the numpy method

not so mysterious after all.

Bruce Southey wrote:

> But if everything is *single-threaded* and 
> thread-safe, then you just create a function and use Anne's very useful 
> handythread.py (http://www.scipy.org/Cookbook/Multithreading).

This may be worth a try -- though the GIL could well get in the way.

> By the way, if the arrays are sufficiently small, there is a lot of 
> overhead involved such that there is more time in communication than 
> computation.

yup -- clearly the case here. I wonder if it's just array size though -- 
won't cPickle time scale with array size? So it may not be size pe-se, 
but rather how much computation you need for a given size array.

-Chris

[I've enclosed the OP's slightly altered code]

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: myutil.py
Type: application/x-python
Size: 592 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110616/e646e7bc/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shared_mem.py
Type: application/x-python
Size: 2030 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110616/e646e7bc/attachment-0001.bin>