Re: [Numpy-discussion] new MaskedArray class

June 23, 2019

      On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk <
m.h.vankerkwijk@gmail.com> wrote:
...
- If reductions/aggregations default to skipping missing elements, how is
...
it be possible to express "NA propagating" versions, which are also useful,
if slightly less common?
I have been playing with using a new `Mask(np.ndarray)` class for the
mask, which does the actual mask propagation (i.e., all single-operand
ufuncs just copy the mask, binary operations do `logical_or` and reductions
do `logical.and.reduce`). This way the `Masked` class itself can generally
apply a given operation on the data and the masks separately and then
combine the two results (reductions are the exception in that `where` has
to be set). Your particular example here could be solved with a different
`Mask` class, for which reductions do `logical.or.reduce`.
I think it would be much better to use duck-typing for the mask as well, if
possible, rather than a NumPy array subclass. This would facilitate using
alternative mask implementations, e.g., distributed masks, sparse masks,
bit-array masks, etc.

Are there use-cases for propagating masks separately from data? If not, it
might make sense to only define mask operations along with data, which
could be much simpler.
...
We may want to add a standard "skipna" argument on NumPy aggregations,
...
solely for the benefit of duck arrays (and dtypes with missing values). But
that could also be a source of confusion, especially if skipna=True refers
only "true NA" values, not including NaN, which is used as an alias for NA
in pandas and elsewhere.
It does seem `where` should suffice, no? If one wants to be super-fancy,
we could allow it to be a callable, which, if a ufunc, gets used inside the
loop (`where=np.isfinite` would be particularly useful).
Let me try to make the API issue more concrete. Suppose we have a
MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1)
with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you
need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1).
Maybe there isn't as strong a need for (2) as I thought?

Your proposal would be something like np.sum(array,
where=np.ones_like(array))? This seems rather verbose for a common
operation. Perhaps np.sum(array, where=True) would work, making use of
broadcasting? (I haven't actually checked whether this is well-defined yet.)

Re: [Numpy-discussion] new MaskedArray class

Stephan Hoyer