
On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
- If reductions/aggregations default to skipping missing elements, how is
it be possible to express "NA propagating" versions, which are also useful, if slightly less common?
I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.
I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc. Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.
We may want to add a standard "skipna" argument on NumPy aggregations,
solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.
It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful).
Let me try to make the API issue more concrete. Suppose we have a MaskedArray with values [1, 2, NA]. How do I get: 1. The sum ignoring masked values, i.e., 3. 2. The sum that is tainted by masked values, i.e., NA. Here's how this works with existing array libraries: - With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum. - With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False). - With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought? Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)