
Hi Stephan, In slightly changed order: Let me try to make the API issue more concrete. Suppose we have a
MaskedArray with values [1, 2, NA]. How do I get: 1. The sum ignoring masked values, i.e., 3. 2. The sum that is tainted by masked values, i.e., NA.
Here's how this works with existing array libraries: - With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum. - With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False). - With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought?
I think this is all correct.
Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)
I think we'd need to consider separately the operation on the mask and on
the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask. I have been playing with using a new `Mask(np.ndarray)` class for the mask,
which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.
I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc.
Implicitly in the above, I agree with having the mask not necessarily be a plain ndarray, but something that can determine part of the action. Makes sense to generalize that to duck arrays for the reasons you give. Indeed, if we let the mask do the mask propagation as well, it might help make the implementation substantially easier (e.g., `logical_and.reduce` and `logical_or.reduce` can be super-fast on a bitmask!).
Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.
I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument). It may be good to collect a few more test cases... E.g., I'd like to mask some of the astropy classes that are only very partial duck arrays, in that they cover only the shape aspect, and which do have some operators and for which it would be nice not to feel forced to use __array_ufunc__. All the best, Marten