Re: [Numpy-discussion] new MaskedArray class

June 23, 2019

      Hi Stephan,

In slightly changed order:

Let me try to make the API issue more concrete. Suppose we have a
...
MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.
Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1)
with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you
need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1).
Maybe there isn't as strong a need for (2) as I thought?
I think this is all correct.
...
Your proposal would be something like np.sum(array,
where=np.ones_like(array))? This seems rather verbose for a common
operation. Perhaps np.sum(array, where=True) would work, making use of
broadcasting? (I haven't actually checked whether this is well-defined yet.)
I think we'd need to consider separately the operation on the mask and on
the data. In my proposal, the data would always do `np.sum(array,
where=~mask)`, while how the mask would propagate might depend on the mask
itself, i.e., we'd have different mask types for `skipna=True` (default)
and `False` ("contagious") reductions, which differed in doing
`logical_and.reduce` or `logical_or.reduce` on the mask.

I have been playing with using a new `Mask(np.ndarray)` class for the mask,
...
...
which does the actual mask propagation (i.e., all single-operand ufuncs
just copy the mask, binary operations do `logical_or` and reductions do
`logical.and.reduce`). This way the `Masked` class itself can generally
apply a given operation on the data and the masks separately and then
combine the two results (reductions are the exception in that `where` has
to be set). Your particular example here could be solved with a different
`Mask` class, for which reductions do `logical.or.reduce`.
I think it would be much better to use duck-typing for the mask as well,
if possible, rather than a NumPy array subclass. This would facilitate
using alternative mask implementations, e.g., distributed masks, sparse
masks, bit-array masks, etc.
Implicitly in the above, I agree with having the mask not necessarily be a
plain ndarray, but something that can determine part of the action. Makes
sense to generalize that to duck arrays for the reasons you give. Indeed,
if we let the mask do the mask propagation as well, it might help make the
implementation substantially easier (e.g., `logical_and.reduce` and
`logical_or.reduce` can be super-fast on a bitmask!).
...
Are there use-cases for propagating masks separately from data? If not, it
might make sense to only define mask operations along with data, which
could be much simpler.
I had only thought about separating out the concern of mask propagation
from the "MaskedArray" class to the mask proper, but it might indeed make
things easier if the mask also did any required preparation for passing
things on to the data (such as adjusting the "where" argument in a
reduction). I also like that this way the mask can determine even before
the data what functionality is available (i.e., it could be the place from
which to return `NotImplemented` for a ufunc.at call with a masked index
argument).

It may be good to collect a few more test cases... E.g., I'd like to mask
some of the astropy classes that are only very partial duck arrays, in that
they cover only the shape aspect, and which do have some operators and for
which it would be nice not to feel forced to use __array_ufunc__.

All the best,

Marten

Re: [Numpy-discussion] new MaskedArray class

Marten van Kerkwijk