[Numpy-discussion] Are masked arrays slower for processing than ndarrays?

Sat May 9 20:18:49 EDT 2009

Short answer to the subject: Oh yes.
Basically, MaskedArrays in its current implementation is more of a  
convenience class than anything. Most of the functions manipulating  
masked arrays create a lot of temporaries. When performance is needed,  
I must advise you to work directly on the data and the mask.

For example, let's examine the division of 2 MaskedArrays a & b.
* We take the 2 ndarrays of data (da and db) and the 2 ndarrays of  
mask (ma and mb)
* we create a new array for db using np.where, putting 1 where db==0  
and keeping db otherwise (if we were not doing that, we would get some  
NaNs down the road)
* we create a new mask by combining ma and mb
* we create the result array using np.where, using da where m is True,  
da/db otherwise (if we were not doing that, we would be processing the  
masked data and we may not want that)
* Then, we add the mask to the result array.

I suspect that the np.where functions are sub-optimal, and there might  
be a smarter way to achieve the same result while keeping all the  
functionalities (no NaNs (even masked) in the result, data kept when  
it should). I agree that these functionalities might be a bit overkill  
in simpler cases, such as yours. You may then want to use something like

 >>> ma.masked_array(a.data/b.data, mask=(a.mask | b.mask | (b.data==0))

Using Eric's example, I have 229ms/loop when dividing 2 ndarrays,  
2.83s/loop when dividing 2 masked arrays, and down to 493ms/loop when  
using the quick-and-dirty function above). So anyway, you'll still be  
slower using MA than ndarrays, but not as slow...

On May 9, 2009, at 5:22 PM, Eli Bressert wrote:

> Hi,
>
> I'm using masked arrays to compute large-scale standard deviation,
> multiplication, gaussian, and weighted averages. At first I thought
> using the masked arrays would be a great way to sidestep looping
> (which it is), but it's still slower than expected. Here's a snippet
> of the code that I'm using it for.
>
> # Computing nearest neighbor distances.
> # Output will be about 270,000 rows long for the index
> # and 270,000x50 for the dist array.
> tree = ann.kd_tree(np.column_stack([l,b]))
> index, dist = tree.search(np.column_stack([l,b]),k=nth)
>
> # Clipping bad values by replacing them acceptable values
> av[np.where(av<=-10)] = -10
> av[np.where(av>=50)] = 50
>
> # Distance clipping and creating mask
> dist_arcsec = np.sqrt(dist)*3600
> mask = dist_arcsec <= d_thresh
>
> # Creating masked array
> av_good = ma.array(av[index],mask=mask)
> dist_good = ma.array(dist_arcsec,mask=mask)
>
> # Reason why I'm using masked arrays. If these were
> # ndarrays with nan's, then the output would be nan.
> Std = np.array(np.std(av_good,axis=1))
> Var = Std*Std
>
> Rho = np.zeros( (len(av), nth) )
> Rho2  = np.zeros( (len(av), nth) )
>
> dist_std = np.std(dist_good,axis=1)
>
> for j in range(nth):
>     Rho[:,j] = dist_std
>     Rho2[:,j] = Var
>
> # This part takes about 20 seconds to compute for a 270,000x50  
> masked array.
> # Using ndarrays of the same size takes about 2 second
> spatial_weight = 1.0 / (Rho*np.sqrt(2*np.pi)) * np.exp( - dist_good /
> (2*Rho**2))
>
> # Like the spatial_weight section, this takes about 20 seconds
> W = spatial_weight / Rho2
>
> # Takes less than one second.
> Ave = np.average(av_good,axis=1,weights=W)
>
> Any ideas on why it would take such a long time for processing?
> Especially the spatial_weight and W variables? Would there be a faster
> way to do this? Or is there a way that numpy.std can process ignore
> nan's when processing?
>
> Thanks,
>
> Eli Bressert
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion