Advice on masked array implementation

We would like some advice on how to proceed with implementing masked array capabilities in a large package of climate-related analysis functions. We are in the initial stages of trying to duplicate functionality from an existing package written in a locally-developed scripting language. The existing functionality depends heavily on masked arrays (actually on variables with attributes, with "fill_value" being one such). We have polled our user base and, while all responders plan to convert to NumPy, many have not started the conversion or are in transition. It is our experience that converting users to new ways is usually a multi-year undertaking, so it seems that we may need to support both Numeric and NumPy installations for some time to come. Even if people have converted to the NumPy version of our package, they may still be importing packages that have not been converted. Our initial design attempts to deal with the possibility of users potentially mixing (or using exclusively) Numeric arrays, Numeric masked arrays, NumPy arrays, and NumPy masked arrays. For example, suppose you have a function: result = func(arg0, arg1) where the two arguments and return variable can be any one of the four types of arrays mentioned. Currently we are testing to see if either argument is a NumPy or Numeric masked array. If just Numeric masked arrays, then we return a Numeric masked array. If just NumPy masked arrays, then we return a NumPy masked array. If one is a Numeric masked array and the other is a NumPy masked array, then we return a NumPy masked array. Similar checking is done for using just Numeric and/or NumPy non-masked arrays. Does this seem like a reasonable approach? It is tempting just to go with NumPy, but then we will have a large class of users who cannot access the new functionality. We have followed the discussion on the development of the new maskedarray module, but have not used it. I went to http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/ maskedarray.py as referenced in a posting from Pierre GM, but I got "Internal Error." Has there been any decision as to whether maskedarray will be in NumPy version 1.1? Any estimate as to when 1.1 would be out? If we commit to numpy.core.ma now, how much trouble will it be to convert to the new maskedarray? Is there any user documentation on maskedarray and details on the differences between it and numpy.core.ma? Thanks, Fred Clare

On Monday 26 February 2007 14:51:42 Fred Clare wrote:
We would like some advice on how to proceed with implementing masked array capabilities in a large package of climate-related analysis functions.
Sounds great ! I'm working on the same field basically, and I needed MaskedArrays to deal with missing values in environmental series. But we can chat about that off-list.
If one is a Numeric masked array and the other is a NumPy masked array, then we return a NumPy masked array. Similar checking is done for using just Numeric and/or NumPy non-masked arrays. Does this seem like a reasonable approach?
We have followed the discussion on the development of the new maskedarray module, but have not used it. I went to
http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/ maskedarray.py
as referenced in a posting from Pierre GM, but I got "Internal Error."
Yes, I had to take the package off the projects.scipy.org site when I got write access to the svn server, as that particular version was really outdated. You can find the latest version on the scipy svn server, in the sandbox: http://svn.scipy.org/svn/scipy/trunk/Lib/sandbox/maskedarray/ Note that I made some major updates a couple of weeks ago, without advertising them. A description is available at: http://svn.scipy.org/svn/scipy/trunk/Lib/sandbox/maskedarray/CHANGELOG
Has there been any decision as to whether maskedarray will be in NumPy version 1.1?
The decision is out of my hands. My understanding is that before the new implementation can be taken seriously, more feedback is needed from actual users (and I fully agree with that). Moreover, there are some vague plans about porting it to C. My naive initial attempts with Pyrex having failed dramatically, I will have to learn C, so it probably won't happen in the very next weeks... But porting to C should solve some minor issues I'm unhappy with for now, and can't implement in python without significantly degrading the performances.
Any estimate as to when 1.1 would be out? If we commit to numpy.core.ma now, how much trouble will it be to convert to the new maskedarray?
It shouldn't be that a problem. Normally, the following should work (even if some warnings are raised)
import numpy.core.ma as ma import maskedarray as MA x = ma.array([1,2,3,4,5], mask=[1,0,1,0,0]) x array(data = [999999 2 3 999999 5], mask = [ True False False True False], fill_value=999999) X = MA.array(x) X masked_array(data = [-- 2 3 -- 5], mask = [ True False False True False], fill_value=999999)
That is, maskedarray.MaskedArrays recognize numpy.core.ma.maskedarray I tried to keep as much backward compatibility as I could, but without really testing it, so no guarantee.
Is there any user documentation on maskedarray and details on the differences between it and numpy.core.ma?
Not at this point, unfortunately. Note that the "new" implementation follows very closely Paul Dubois' initial code. (In fact, a bit too closely for its own good. Reggie Dugard suggested some modifications I tried to take into account in the latest version that seem to solve that). Therefore, switching from numpy.core.ma to maskedarray should be relatively painless. I'd be more than happy to help you on that. Basically the main differences between the two implementations are: - MaskedArray are regular subclasses of ndarray, so you can use asanyarray without losing your mask. - Subclassing MaskedArray is far easier with the new implementation than it was with numpy.core.ma - the fill_value attribute is now a property - the _data attribute is now a view of the MaskedArray, instead of an independent object. - the underlying _data can be any subclass of ndarray (such as matrix) - some of the MaskedArray methods (ravel, transpose...) are implemented through wrappers that must have a __get__ method. That works well w/ Python2.4, I'm not sure it would work w/ 2.3. - some methods that were not available in numpy.core.ma are now in maskedarray, either .core or .extras - there's a prototype of MaskedRecords objects, that gives the possibility to mask specific fields in a recarray. All in all, I think that the new implementation gets rid of some of the limitations of numpy.core.ma, without affecting too badly performances. The latest test showed that yes, maskedarray is slightly slower than numpy.core.ma (10%), but it provides more functionality: for example, you can prevent a mask to be overwritten, it is very easy to subclass, it interacts nicely with ndarray... I'm using the new implementation systematically for my own projects (which explains why there are regularly some tweakings to the implementation), and Matt Knox and I have been using it for our common TimeSeries project without any difficulty so far. Once again, please do not hesitate to contact me on or off-list if you have any questions/comments/requests.
participants (2)
-
Fred Clare
-
Pierre GM