[Numpy-discussion] ANN: maskedarray

Thu Sep 27 11:45:11 EDT 2007

All,
The latest version of maskedarray has just been released  on the scipy SVN 
sandbox. This version fixes the inconsistencies in filling (see below) and 
introduces some minor modifications for optimization purposes (see below as 
well). Many thanks to Eric Firing and Matt Knox for the fruitful discussions 
at the origin of this release! 

In addition, a bench.py file has been introduced, to compare the speed of 
numpy.ma and maskedarray. Once again, thanks to Eric for his first draft. 

Please feel free to try it and send me some feedback.

Modifications:
* Consistent filling !
In numpy.ma, the division of array A by array B works in several steps:
- A is filled w/ 0
- B is filled w/ 1
- A/B is computed
- the output mask is updated as the combination of A.mask, B.mask and the 
domain mask (B==0)
The problems with this approach are that  (i) it's not useful to fill A and B 
beforehand if the values will be masked anyway; (ii) nothing prevents infs to 
show up, as the domain is taken into account at the end only.

In this latest version of maskedarray, the same division is decomposed as:
- a copy of B._data is filled with 1 with the domain (B==0)
- the division of  A._data by this copy is computed
- the output mask is updated as the combination of A.mask, B.mask and the 
domain mask (B==0).

Prefilling on the domain avoids the presence of nans/infs. However, this comes 
with the price of making some functions and methods slower than their numpy.ma 
counterparts, as you'll be able to observe for sqrt and log with the bench.py 
file. An alternative would be to avoid filling at all, at the risk of leaving 
nans and infs.

* masked_invalid / fix_invalid
Two new functions are introduced. 
masked_invalid(x) masks x where x is nan or inf.
fix_invalid(x) returns (a copy of) x, where invalid values (nans & infs) are 
replaced by fill_value. 

* No mask shrinking
Following Paul Dubois and Sasha's example, I eventually had to get rid of the 
semi-automatic shrinking of the mask in __getitem__, which appeared to be a 
major bottleneck. In other words, one can end up with an array full of False 
instead of nomask, which may slow things down a bit. You can force a mask 
back to nomask with the new shrink_mask method.

*_sharedmask
Here again, I followed Paul and Sasha's ideas and reintroduce the _sharedmask 
flag to prevent inadequate propagation of the mask. When creating a new array 
with x=masked_array(data, mask=m), x._mask is initially a reference to m and 
x._sharedmask is True. When x is modified, x._mask is copied to prevent a 
propagation back to m.