[Numpy-discussion] Are masked arrays slower for processing than ndarrays?

Sat May 9 20:17:55 EDT 2009

Eric Firing wrote:

Pierre,

... I pressed "send" too soon.  There are test failures with the patch I 
attached to my last message.  I think the basic ideas are correct, but 
evidently there are wrinkles to be worked out.  Maybe putmask() has to 
be used instead of where() (putmask is much faster) to maintain the 
ability to do *= and similar, and maybe there are other adjustments. 
Somehow, though, it should be possible to get decent speed for simple 
multiplication and division; a 10x penalty relative to ndarray 
operations is just too much.

Eric

> Eli Bressert wrote:
>> Hi,
>>
>> I'm using masked arrays to compute large-scale standard deviation,
>> multiplication, gaussian, and weighted averages. At first I thought
>> using the masked arrays would be a great way to sidestep looping
>> (which it is), but it's still slower than expected. Here's a snippet
>> of the code that I'm using it for.
> [...]
>> # Like the spatial_weight section, this takes about 20 seconds
>> W = spatial_weight / Rho2
>>
>> # Takes less than one second.
>> Ave = np.average(av_good,axis=1,weights=W)
>>
>> Any ideas on why it would take such a long time for processing?
> 
> A part of the slowdown is what looks to me like unnecessary copying in 
> _MaskedBinaryOperation.__call__.  It is using getdata, which applies 
> numpy.array to its input, forcing a copy.  I think the copy is actually 
> unintentional, in at least one sense, and possibly two: first, because 
> the default argument of getattr is always evaluated, even if it is not 
> needed; and second, because the call to np.array is used where 
> np.asarray or equivalent would suffice.
> 
> The first file attached below shows the kernprof in the case of 
> multiplying two masked arrays, shape (100000,50), with no masked 
> elements; 2/3 of the time is taken copying the data.
> 
> Now, if there are actually masked elements in the arrays, it gets much 
> worse: see the second attachment.  The total time has increased by more 
> than a factor of 3, and the culprit is numpy.which(), a very slow 
> function.  It looks to me like it is doing nothing useful at all; the 
> numpy binary operation is still being executed for all elements, 
> regardless of mask, contrary to the intention implied by the comment in 
> the code.
> 
> The third attached file has a patch that fixes the getdata problem and 
> eliminates the which().
> With this patch applied we get the profile in the 4th file, to be 
> compared to the second profile.  Much better.  I am pretty sure it could 
> still be sped up quite a bit, though.  It looks like the masks are 
> essentially being calculated twice for no good reason, but I don't 
> completely understand all the mask considerations, so at this point I am 
> not trying to fix that problem.
> 
> Eric
> 
> 
>> Especially the spatial_weight and W variables? Would there be a faster
>> way to do this? Or is there a way that numpy.std can process ignore
>> nan's when processing?
>>
>> Thanks,
>>
>> Eli Bressert
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion