[Numpy-discussion] Are masked arrays slower for processing than ndarrays?

Sat May 9 20:06:30 EDT 2009

Eli Bressert wrote:
> Hi,
> 
> I'm using masked arrays to compute large-scale standard deviation,
> multiplication, gaussian, and weighted averages. At first I thought
> using the masked arrays would be a great way to sidestep looping
> (which it is), but it's still slower than expected. Here's a snippet
> of the code that I'm using it for.
[...]
> # Like the spatial_weight section, this takes about 20 seconds
> W = spatial_weight / Rho2
> 
> # Takes less than one second.
> Ave = np.average(av_good,axis=1,weights=W)
> 
> Any ideas on why it would take such a long time for processing?

A part of the slowdown is what looks to me like unnecessary copying in 
_MaskedBinaryOperation.__call__.  It is using getdata, which applies 
numpy.array to its input, forcing a copy.  I think the copy is actually 
unintentional, in at least one sense, and possibly two: first, because 
the default argument of getattr is always evaluated, even if it is not 
needed; and second, because the call to np.array is used where 
np.asarray or equivalent would suffice.

The first file attached below shows the kernprof in the case of 
multiplying two masked arrays, shape (100000,50), with no masked 
elements; 2/3 of the time is taken copying the data.

Now, if there are actually masked elements in the arrays, it gets much 
worse: see the second attachment.  The total time has increased by more 
than a factor of 3, and the culprit is numpy.which(), a very slow 
function.  It looks to me like it is doing nothing useful at all; the 
numpy binary operation is still being executed for all elements, 
regardless of mask, contrary to the intention implied by the comment in 
the code.

The third attached file has a patch that fixes the getdata problem and 
eliminates the which().
With this patch applied we get the profile in the 4th file, to be 
compared to the second profile.  Much better.  I am pretty sure it could 
still be sped up quite a bit, though.  It looks like the masks are 
essentially being calculated twice for no good reason, but I don't 
completely understand all the mask considerations, so at this point I am 
not trying to fix that problem.

Eric

> Especially the spatial_weight and W variables? Would there be a faster
> way to do this? Or is there a way that numpy.std can process ignore
> nan's when processing?
> 
> Thanks,
> 
> Eli Bressert
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof1.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof2.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: macore.diff
Type: text/x-patch
Size: 2285 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment.bin>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof3.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment-0002.txt>