
I think a sensible alternative mental model for the MaskedArray class is
that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).
+1, this sounds like the right model to me.
One small worry is about name clashes - ideally one wants the masked array to be somewhat of a drop-in for whatever class it is masking (independently of whether it is an actual subclass of it). In this respect, `.data` is a pretty terrible name (and, yes, the cause of lots of problems for astropy's MaskedColumn - not my fault, that one!). In my own trials, thinking that names that include "mask" are fair game, I've been considering a function `.unmask(fill_value=None)` which would replace both `.filled(fill_value)` and `.data` by having the default be not to fill anything (I don't see why a masked array should carry a fill value along; one might use specific strings such as 'minmax' for auto-generated cases). If wanted, one could then add `unmasked = property(unmask)`. Aside: my sense is to, at first at least, feel as unbound as possible from the current MaskedArray - one could then use whatever it is to try to create something that comes close to reproducing it, but only for ease of transition.
That said, I would still not guarantee values under the mask as part of NumPy's API. The result of computations under the mask should be considered an undefined implementation detail, sort of like integer overflow or dict iteration order pre-Python 3.7. The values may even be entirely arbitrary, e.g., in cases where the result is preallocated with empty().
I think that is reasonable. The use cases Ben and I described both are ones where the array is being used as input for a set of computations which differ only in their mask. (Admittedly, in both our cases one could just reinitialize a masked array with the new mask; but I think we share the mental model of that if I don't operate on the masked array, the data doesn't change, so I should just be able to change the mask.)
I'm less confident about the right way to handle missing elements in reductions. For example: - Should median() also skip missing elements, even though there is no identity element?
I think so. If for mean(), std(), etc., the number of unmasked elements comes into play, I don't see why it wouldn't for median(). - If reductions/aggregations default to skipping missing elements, how is
it be possible to express "NA propagating" versions, which are also useful, if slightly less common?
I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`. A larger issue is the accumulations. Personally, I think those are basically meaningless for masked arrays, as to me logically the result on the position of any masked item should be masked. But, if so, I do not see how the ones "beyond" it could not be masked as well. Since here the right answers seems at least unclear, my sense would be to refuse the temptation to guess (i.e., the user should just explicitly fill with ufunc.identity if this is the right thing to do in their case). I should add that I'm slightly torn about a similar, somewhat related issue: what should `np.minimum(a, b)` do for the case where either a or b is masked? Currently, one just treats this as a bin-op, so the result is masked, but one could argue that this ufunc is a bit like a 2-element reduction, and thus that the unmasked item should "win by default". Possibly, the answer should be different between `np.minimum` and `np.fmin` (since the two differ in how they propagate `NaN` as well - note that you don't include `fmin` and `fmax` in your coverage). We may want to add a standard "skipna" argument on NumPy aggregations,
solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.
It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful). All the best, Marten