[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 11:40:21 EDT 2011

On Thu, Jun 23, 2011 at 7:56 PM, Benjamin Root <ben.root at ou.edu> wrote:

> On Thu, Jun 23, 2011 at 7:28 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>
>> Sorry y'all, I'm just commenting bits by bits:
>>
>> "One key problem is a lack of orthogonality with other features, for
>> instance creating a masked array with physical quantities can't be done
>> because both are separate subclasses of ndarray. The only reasonable way to
>> deal with this is to move the mask into the core ndarray."
>>
>> Meh. I did try to make it easy to use masked arrays on top of subclasses.
>> There's even some tests in the suite to that effect (test_subclassing). I'm
>> not buying the argument.
>> About moving mask in the core ndarray: I had suggested back in the days to
>> have a mask flag/property built-in ndarrays (which would *really* have
>> simplified the game), but this suggestion  was dismissed very quickly as
>> adding too much overload. I had to agree. I'm just a tad surprised the wind
>> has changed on that matter.
>>
>>
>> "In the current masked array, calculations are done for the whole array,
>> then masks are patched up afterwords. This means that invalid calculations
>> sitting in masked elements can raise warnings or exceptions even though they
>> shouldn't, so the ufunc error handling mechanism can't be relied on."
>>
>> Well, there's a reason for that. Initially, I tried to guess what the mask
>> of the output should be from the mask of the inputs, the objective being to
>> avoid getting NaNs in the C array. That was easy in most cases,  but it
>> turned out it wasn't always possible (the `power` one caused me a lot of
>> issues, if I recall correctly). So, for performance issues (to avoid a lot
>> of expensive tests), I fell back on the old concept of "compute them all,
>> they'll be sorted afterwards".
>> Of course, that's rather clumsy an approach. But it works not too badly
>> when in pure Python. No doubt that a proper C implementation would work
>> faster.
>> Oh, about using NaNs for invalid data ? Well, can't work with integers.
>>
>> `mask` property:
>> Nothing to add to it. It's basically what we have now (except for the
>> opposite convention).
>>
>> Working with masked values:
>> I recall some strong points back in the days for not using None to
>> represent missing values...
>> Adding a maskedstr argument to array2string ? Mmh... I prefer a global
>> flag like we have now.
>>
>> Design questions:
>> Adding `masked` or whatever we call it to a number/array should result is
>> masked/a fully masked array, period. That way, we can have an idea that
>> something was wrong with the initial dataset.
>> hardmask: I never used the feature myself. I wonder if anyone did. Still,
>> it's a nice idea...
>>
>
> As a heavy masked_array user, I regret not being able to participate more
> in this discussion as I am madly cranking out matplotlib code.  I would like
> to say that I have always seen masked arrays as being the "next step up"
> from using arrays with NaNs.  The hardmask/softmask/sharedmasked concepts
> are powerful, and I don't think they have yet to be exploited to their
> fullest potential.
>

Do you have some examples where hardmask or sharedmask are being used? I
like the idea of using a hardmask array as the return value for boolean
indexing, but some more use cases would be nice.

> Masks are (relatively) easy when dealing with element-by-element operations
> that produces an array of the same shape (or at least the same number of
> elements in the case of reshape and transpose).  What gets difficult is for
> reductions such as sum or max, etc.  Then you get into the weirder cases
> such as unwrap and gradients that I brought up recently.  I am not sure how
> to address this, but I am not a fan of the idea of adding yet another
> parameter to the ufuncs to determine what to do for filling in a mask.
>

It looks like in R there is a parameter called na.rm=T/F, which basically
means "remove NAs before doing the computation". This approach seems good to
me for reduction operations.

Also, just to make things messier, there is an incomplete feature that was
> made for record arrays with regards to masking.  The idea was to allow for
> element-by-element masking, but also allow for row-by-row (or was it
> column-by-column?) masking.  I thought it was a neat feature, and it is too
> bad that it was not finished.
>

I put this in my design, I think this would be useful too. I would call it
field by field, though many people like thinking of the struct dtype fields
as columns.

Anyway, my opinion is that a mask should be True for a value that needs to
> be hidden.  Do not change this convention.  People coming into python
> already has to change code, a simple bit flip for them should be fine.
> Breaking existing python code is worse.
>

I'm now thinking the mask needs to be pushed away into the background to
where it becomes be an unimportant implementation detail of the system. It
deserves a long cumbersome name like "validitymask", and then the system can
use something close R's approach with an NA-like singleton for most
operations.

> I also don't see it as entirely necessary for *all* of masked arrays to be
> brought into numpy core.  Only the most important parts/hooks need  to be.
> We could then still have a masked array class that provides the finishing
> touches such as the sharing of masks and special masked related functions.
>

That's reasonable, yeah.

Lastly, I am not entirely familiar with R, so I am also very curious about
> what this magical "NA" value is, and how it compares to how NaNs work.
> Although, Pierre brought up the very good point that NaNs woulldn't work
> anyway with integer arrays (and object arrays, etc.).
>

It's similar to NaN, but has the interpretation Nathanial pointed out, where
there is a valid value but it's unknown. A consequence of that is that
logical_and(<boolean NA>, False) is False, something which behaves
differently from NaNs.

-Mark

>
> Back to toiling on matplotlib,
> Ben Root
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/3978a5b4/attachment.html>