[Numpy-discussion] Are masked arrays slower for processing than ndarrays?
Eric Firing
efiring at hawaii.edu
Sat May 9 20:06:30 EDT 2009
Eli Bressert wrote:
> Hi,
>
> I'm using masked arrays to compute large-scale standard deviation,
> multiplication, gaussian, and weighted averages. At first I thought
> using the masked arrays would be a great way to sidestep looping
> (which it is), but it's still slower than expected. Here's a snippet
> of the code that I'm using it for.
[...]
> # Like the spatial_weight section, this takes about 20 seconds
> W = spatial_weight / Rho2
>
> # Takes less than one second.
> Ave = np.average(av_good,axis=1,weights=W)
>
> Any ideas on why it would take such a long time for processing?
A part of the slowdown is what looks to me like unnecessary copying in
_MaskedBinaryOperation.__call__. It is using getdata, which applies
numpy.array to its input, forcing a copy. I think the copy is actually
unintentional, in at least one sense, and possibly two: first, because
the default argument of getattr is always evaluated, even if it is not
needed; and second, because the call to np.array is used where
np.asarray or equivalent would suffice.
The first file attached below shows the kernprof in the case of
multiplying two masked arrays, shape (100000,50), with no masked
elements; 2/3 of the time is taken copying the data.
Now, if there are actually masked elements in the arrays, it gets much
worse: see the second attachment. The total time has increased by more
than a factor of 3, and the culprit is numpy.which(), a very slow
function. It looks to me like it is doing nothing useful at all; the
numpy binary operation is still being executed for all elements,
regardless of mask, contrary to the intention implied by the comment in
the code.
The third attached file has a patch that fixes the getdata problem and
eliminates the which().
With this patch applied we get the profile in the 4th file, to be
compared to the second profile. Much better. I am pretty sure it could
still be sped up quite a bit, though. It looks like the masks are
essentially being calculated twice for no good reason, but I don't
completely understand all the mask considerations, so at this point I am
not trying to fix that problem.
Eric
> Especially the spatial_weight and W variables? Would there be a faster
> way to do this? Or is there a way that numpy.std can process ignore
> nan's when processing?
>
> Thanks,
>
> Eli Bressert
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof1.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof2.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: macore.diff
Type: text/x-patch
Size: 2285 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment.bin>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: prof3.txt
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090509/95381422/attachment-0002.txt>
More information about the NumPy-Discussion
mailing list