[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Thu Jun 23 21:00:07 EDT 2011

On Jun 24, 2011, at 2:42 AM, Mark Wiebe wrote:

> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
> Sorry y'all, I'm just commenting bits by bits:
> 
> "One key problem is a lack of orthogonality with other features, for instance creating a masked array with physical quantities can't be done because both are separate subclasses of ndarray. The only reasonable way to deal with this is to move the mask into the core ndarray."
> 
> Meh. I did try to make it easy to use masked arrays on top of subclasses. There's even some tests in the suite to that effect (test_subclassing). I'm not buying the argument.
> About moving mask in the core ndarray: I had suggested back in the days to have a mask flag/property built-in ndarrays (which would *really* have simplified the game), but this suggestion  was dismissed very quickly as adding too much overload. I had to agree. I'm just a tad surprised the wind has changed on that matter.
>  
> Ok, I'll have to change that section then. :)
> 
> I don't remember seeing mention of this ability in the documentation, but I may not have been reading closely enough for that part. 

Or played with it ;)

>  
> "In the current masked array, calculations are done for the whole array, then masks are patched up afterwords. This means that invalid calculations sitting in masked elements can raise warnings or exceptions even though they shouldn't, so the ufunc error handling mechanism can't be relied on."
> 
> Well, there's a reason for that. Initially, I tried to guess what the mask of the output should be from the mask of the inputs, the objective being to avoid getting NaNs in the C array. That was easy in most cases,  but it turned out it wasn't always possible (the `power` one caused me a lot of issues, if I recall correctly). So, for performance issues (to avoid a lot of expensive tests), I fell back on the old concept of "compute them all, they'll be sorted afterwards".
> Of course, that's rather clumsy an approach. But it works not too badly when in pure Python. No doubt that a proper C implementation would work faster.
> Oh, about using NaNs for invalid data ? Well, can't work with integers.
> 
> In my proposal, NaNs stay as unmasked NaN values, instead of turning into masked values. This is necessary for uniform treatment of all dtypes, but a subclass could override this behavior with an extra mask modification after arithmetic operations. 

No problem with that...

> `mask` property:
> Nothing to add to it. It's basically what we have now (except for the opposite convention).
> 
> Working with masked values:
> I recall some strong points back in the days for not using None to represent missing values...
> Adding a maskedstr argument to array2string ? Mmh... I prefer a global flag like we have now.
> 
> I'm not really a fan of all the global state that NumPy keeps, I guess I'm trying to stamp that out bit by bit as well where I can... 

Pretty convenient to define a default once for all, though.

> Design questions:
> Adding `masked` or whatever we call it to a number/array should result is masked/a fully masked array, period. That way, we can have an idea that something was wrong with the initial dataset.
> 
> I'm not sure I understand what you mean, in the design adding a mask means setting "a.mask = True", "a.mask = False", or "a.mask = <boolean array>" in general. 

I mean that:
0 + ma.masked = ma.masked
ma.array([1,2,3], mask=False) + ma.masked = ma.array([1,2,3], mask=[True,True,True])

By extension, any operation involving a masked value should result in a masked value.

> hardmask: I never used the feature myself. I wonder if anyone did. Still, it's a nice idea...
> 
> Ok, I'll leave that out of the initial design unless someone comes up with some strong use cases.

Oh, it doesn't eat bread (as we say in French), so you can leave it where it is...