Re: [Numpy-discussion] Indexing a masked array with another masked array leads to unexpected results
On Fri, Nov 4, 2011 at 5:26 AM, Pierre GM
On Nov 03, 2011, at 23:07 , Joe Kington wrote:
I'm not sure if this is exactly a bug, per se, but it's a very confusing consequence of the current design of masked arrays… I would just add a "I think" between the "but" and "it's" before I could agree.
Consider the following example:
import numpy as np
x = np.ma.masked_all(10, dtype=np.float32) print x x[x > 0] = 5 print x
The exact results will vary depending the contents of the empty memory the array was initialized from.
Not a surprise. But isn't mentioned in the doc somewhere that using a masked array as index is a very bad idea ? And that you should always fill it before you use it as an array ? (Actually, using a MaskedArray as index used to raise an IndexError. But I thought it was a bit too harsh, so I dropped it).
Not that I can find in the docs (Perhaps I just missed it?). At any rate, it's not mentioned in the numpy.ma section on indexing: http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html#indexing-... The only mention of it is a comment in MaskedArray.__setitem__ where the IndexError is commented out.
ma.masked_all is an empty array with all its elements masked. Ie, you have an uninitialized ndarray as data, and a bool array of the same size, full of True. The operative word is here "uninitialized".
This wreaks havoc when filtering the contents of masked arrays (and leads to hard-to-find bugs!). The mask of the array in question is altered at random (or, rather, based on the masked values as well as the masked ones).
Once again, you're working on an *uninitialized* array. What you should really do is to initialize it first, e.g. by 0, or whatever would make sense in your field, and then work from that.
Sure, I shouldn't have used that as the example. My point was that it's counter-intuitive that something like "x[x > 0] = 0" alters the mask of x based on the values of _masked_ elements. How it's initialized is irrelevant (though, of course, it wouldn't be semi-random if it were initialized in another way).
I can see the reasoning behind the way it works. It makes sense that "x 0" returns a masked boolean array with potentially several elements masked, as well as the unmasked elements greater than 0.
Well, "x > 0" is also a masked array, with its mask full of True. Not very usable by itself, and especially *not* for indexing.
However, wouldn't it make more sense to have MaskedArray.__setitem__ only operate on the unmasked elements of the "indx" passed in (at least in the case where the assigned "value" isn't a masked array)?
Normally, that should be the case. But you're not working in "normal" conditions, here. A bit like trying to boil water on a stove with a plastic pan.
"x[x > threshold] = something" is a very common idiom for ndarrays. I think most people would find it surprising that this operation doesn't ignore the masked values. I noticed this because one of my coworkers was complaining that a piece of my code was "messing up" their masked arrays. I'd never tested it with masked arrays, but it took me ages to find, just because I wasn't looking in places where I was just using common idioms. In this particular case, they'd initialized it with "masked_all", so it effectively altered the mask of the array at random. Regardless of how it was initialized, though, it is surprising that the mask of "x" is changed based on masked values. I just think it would be useful for it to be documented. Cheers, -Joe
participants (1)
-
Joe Kington