Re: [Numpy-discussion] new MaskedArray class

June 23, 2019

      I think we’d need to consider separately the operation on the mask and on
the data. In my proposal, the data would always do np.sum(array,
where=~mask), while how the mask would propagate might depend on the mask
itself,

I quite like this idea, and I think Stephan’s strawman design is actually
plausible, where MaskedArray.mask is either an InvalidMask or a IgnoreMask
instance to pick between the different propagation types. Both classes
could simply have an underlying ._array attribute pointing to a duck-array
of some kind that backs their boolean data.

The second version requires that you *also* know how Mask classes work, and
how they implement +

I remain unconvinced that Mask classes should behave differently on
different ufuncs. I don’t think np.minimum(ignore_na, b) is any different
to np.add(ignore_na, b) - either both should produce b, or both should
produce ignore_na. I would lean towards produxing ignore_na, and
propagation behavior differing between “ignore” and “invalid” only for
reduce / accumulate operations, where the concept of skipping an
application is well-defined.

Some possible follow-up questions that having two distinct masked types
raise:

   - what if I want my data to support both invalid and skip fields at the
   same time? sum([invalid, skip, 1]) == invalid
   - is there a use case for more that these two types of mask?
   invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting
   things to track through a calculation, possibly a dictionary of named masks.

Eric

On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <shoyer@gmail.com> wrote:
...
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <
m.h.vankerkwijk@gmail.com> wrote:
...
Your proposal would be something like np.sum(array,
...
where=np.ones_like(array))? This seems rather verbose for a common
operation. Perhaps np.sum(array, where=True) would work, making use of
broadcasting? (I haven't actually checked whether this is well-defined yet.)
I think we'd need to consider separately the operation on the mask and
on the data. In my proposal, the data would always do `np.sum(array,
where=~mask)`, while how the mask would propagate might depend on the mask
itself, i.e., we'd have different mask types for `skipna=True` (default)
and `False` ("contagious") reductions, which differed in doing
`logical_and.reduce` or `logical_or.reduce` on the mask.
OK, I think I finally understand what you're getting at. So suppose this
this how we implement it internally. Would we really insist on a user
creating a new MaskedArray with a new mask object, e.g., with a GreedyMask?
We could add sugar for this, but certainly array.greedy_masked().sum() is
significantly less clear than array.sum(skipna=False).
I'm also a little concerned about a proliferation of MaskedArray/Mask
types. New types are significantly harder to understand than new functions
(or new arguments on existing functions). I don't know if we have enough
distinct use cases for this many types.
Are there use-cases for propagating masks separately from data? If not, it
...
...
might make sense to only define mask operations along with data, which
could be much simpler.
I had only thought about separating out the concern of mask propagation
from the "MaskedArray" class to the mask proper, but it might indeed make
things easier if the mask also did any required preparation for passing
things on to the data (such as adjusting the "where" argument in a
reduction). I also like that this way the mask can determine even before
the data what functionality is available (i.e., it could be the place from
which to return `NotImplemented` for a ufunc.at call with a masked index
argument).
You're going to have to come up with something more compelling than
"separation of concerns" to convince me that this extra Mask abstraction is
worthwhile. On its own, I think a separate Mask class would only obfuscate
MaskedArray functions.
For example, compare these two implementations of add:
def  add1(x, y):
    return MaskedArray(x.data + y.data,  x.mask | y.mask)
def  add2(x, y):
    return MaskedArray(x.data + y.data,  x.mask + y.mask)
The second version requires that you *also* know how Mask classes work,
and how they implement +. So now you need to look in at least twice as many
places to understand add() for MaskedArray objects.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Eric Wieser