[Numpy-discussion] new MaskedArray class

Sun Jun 23 09:07:29 EDT 2019

> I think a sensible alternative mental model for the MaskedArray class is
>> that all it does is forward any operations to the data it holds and
>> separately propagate a mask, ORing elements together for binary operations,
>> etc., and explicitly skipping masked elements in reductions (ideally using
>> `where` to be as agnostic as possible about the underlying data, for which,
>> e.g., setting masked values to `0` for `np.reduce.add` may or may not be
>> the right thing to do - what if they are string?).
>>
>
> +1, this sounds like the right model to me.
>

One small worry is about name clashes - ideally one wants the masked array
to be somewhat of a drop-in for whatever class it is masking (independently
of whether it is an actual subclass of it). In this respect, `.data` is a
pretty terrible name (and, yes, the cause of lots of problems for astropy's
MaskedColumn - not my fault, that one!). In my own trials, thinking that
names that include "mask" are fair game, I've been considering a function
`.unmask(fill_value=None)` which would replace both `.filled(fill_value)`
and `.data` by having the default be not to fill anything (I don't see why
a masked array should carry a fill value along; one might use specific
strings such as 'minmax' for auto-generated cases). If wanted, one could
then add `unmasked = property(unmask)`.

Aside: my sense is to, at first at least, feel as unbound as possible from
the current MaskedArray - one could then use whatever it is to try to
create something that comes close to reproducing it, but only for ease of
transition.

> That said, I would still not guarantee values under the mask as part of
> NumPy's API. The result of computations under the mask should be considered
> an undefined implementation detail, sort of like integer overflow or dict
> iteration order pre-Python 3.7. The values may even be entirely arbitrary,
> e.g., in cases where the result is preallocated with empty().
>

I think that is reasonable. The use cases Ben and I described both are ones
where the array is being used as input for a set of computations which
differ only in their mask. (Admittedly, in both our cases one could just
reinitialize a masked array with the new mask; but I think we share the
mental model of that if I don't operate on the masked array, the data
doesn't change, so I should just be able to change the mask.)

> I'm less confident about the right way to handle missing elements in
> reductions. For example:
> - Should median() also skip missing elements, even though there is no
> identity element?
>

I think so. If for mean(), std(), etc., the number of unmasked elements
comes into play, I don't see why it wouldn't for median().

- If reductions/aggregations default to skipping missing elements, how is
> it be possible to express "NA propagating" versions, which are also useful,
> if slightly less common?
>

I have been playing with using a new `Mask(np.ndarray)` class for the mask,
which does the actual mask propagation (i.e., all single-operand ufuncs
just copy the mask, binary operations do `logical_or` and reductions do
`logical.and.reduce`). This way the `Masked` class itself can generally
apply a given operation on the data and the masks separately and then
combine the two results (reductions are the exception in that `where` has
to be set). Your particular example here could be solved with a different
`Mask` class, for which reductions do `logical.or.reduce`.

A larger issue is the accumulations. Personally, I think those are
basically meaningless for masked arrays, as to me logically the result on
the position of any masked item should be masked. But, if so, I do not see
how the ones "beyond" it could not be masked as well. Since here the right
answers seems at least unclear, my sense would be to refuse the temptation
to guess (i.e., the user should just explicitly fill with ufunc.identity if
this is the right thing to do in their case).

I should add that I'm slightly torn about a similar, somewhat related
issue: what should `np.minimum(a, b)` do for the case where either a or b
is masked? Currently, one just treats this as a bin-op, so the result is
masked, but one could argue that this ufunc is a bit like a 2-element
reduction, and thus that the unmasked item should "win by default".
Possibly, the answer should be different between `np.minimum` and `np.fmin`
(since the two differ in how they propagate `NaN` as well - note that you
don't include `fmin` and `fmax` in your coverage).

We may want to add a standard "skipna" argument on NumPy aggregations,
> solely for the benefit of duck arrays (and dtypes with missing values). But
> that could also be a source of confusion, especially if skipna=True refers
> only "true NA" values, not including NaN, which is used as an alias for NA
> in pandas and elsewhere.
>

It does seem `where` should suffice, no? If one wants to be super-fancy, we
could allow it to be a callable, which, if a ufunc, gets used inside the
loop (`where=np.isfinite` would be particularly useful).

All the best,

Marten
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190623/8debfb7d/attachment.html>