[Numpy-discussion] Indexing a masked array with another masked array leads to unexpected results

Fri Nov 4 12:28:01 EDT 2011

On Fri, Nov 4, 2011 at 5:26 AM, Pierre GM <pgmdevlist at gmail.com> wrote:

>
> On Nov 03, 2011, at 23:07 , Joe Kington wrote:
>
> > I'm not sure if this is exactly a bug, per se, but it's a very confusing
> consequence of the current design of masked arrays…
> I would just add a "I think" between the "but" and "it's" before I could
> agree.
>
> > Consider the following example:
> >
> > import numpy as np
> >
> > x = np.ma.masked_all(10, dtype=np.float32)
> > print x
> > x[x > 0] = 5
> > print x
> >
> > The exact results will vary depending the contents of the empty memory
> the array was initialized from.
>
> Not a surprise. But isn't mentioned in the doc somewhere that using a
> masked array as index is a very bad idea ? And that you should always fill
> it before you use it as an array ? (Actually, using a MaskedArray as index
> used to raise an IndexError. But I thought it was a bit too harsh, so I
> dropped it).
>

Not that I can find in the docs (Perhaps I just missed it?). At any rate,
it's not mentioned in the numpy.ma section on indexing:
http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html#indexing-and-slicing

The only mention of it is a comment in MaskedArray.__setitem__ where the
IndexError is commented out.

> ma.masked_all is an empty array with all its elements masked. Ie, you have
> an uninitialized ndarray as data, and a bool array of the same size, full
> of True. The operative word is here "uninitialized".
>
> > This wreaks havoc when filtering the contents of masked arrays (and
> leads to hard-to-find bugs!).  The mask of the array in question is altered
> at random (or, rather, based on the masked values as well as the masked
> ones).
>
> Once again, you're working on an *uninitialized* array. What you should
> really do is to initialize it first, e.g. by 0, or whatever would make
> sense in your field, and then work from that.
>

Sure, I shouldn't have used that as the example.

My point was that it's counter-intuitive that something like "x[x > 0] = 0"
alters the mask of x based on the values of _masked_ elements.  How it's
initialized is irrelevant (though, of course, it wouldn't be semi-random if
it were initialized in another way).

> > I can see the reasoning behind the way it works. It makes sense that "x
> > 0" returns a masked boolean array with potentially several elements
> masked, as well as the unmasked elements greater than 0.
>
> Well, "x > 0" is also a masked array, with its mask full of True. Not very
> usable by itself, and especially *not* for indexing.

> > However, wouldn't it make more sense to have MaskedArray.__setitem__
> only operate on the unmasked elements of the "indx" passed in (at least in
> the case where the assigned "value" isn't a masked array)?
>
>
> Normally, that should be the case. But you're not working in "normal"
> conditions, here. A bit like trying to boil water on a stove with a plastic
> pan.
>

"x[x > threshold] = something" is a very common idiom for ndarrays.

I think most people would find it surprising that this operation doesn't
ignore the masked values.

I noticed this because one of my coworkers was complaining that a piece of
my code was "messing up" their masked arrays.  I'd never tested it with
masked arrays, but it took me ages to find, just because I wasn't looking
in places where I was just using common idioms.  In this particular case,
they'd initialized it with "masked_all", so it effectively altered the mask
of the array at random.  Regardless of how it was initialized, though, it
is surprising that the mask of "x" is changed based on masked values.

I just think it would be useful for it to be documented.

Cheers,

-Joe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111104/68037717/attachment.html>