Re: [Numpy-discussion] new MaskedArray class

June 24, 2019

      On 6/23/19 6:58 PM, Eric Wieser wrote:
...
I think we’d need to consider separately the operation on the mask
    and on the data. In my proposal, the data would always do
    |np.sum(array, where=~mask)|, while how the mask would propagate
    might depend on the mask itself,
I quite like this idea, and I think Stephan’s strawman design is
actually plausible, where |MaskedArray.mask| is either an |InvalidMask|
or a |IgnoreMask| instance to pick between the different propagation
types. Both classes could simply have an underlying |._array| attribute
pointing to a duck-array of some kind that backs their boolean data.
The second version requires that you /also/ know how Mask classes
    work, and how they implement +
I remain unconvinced that Mask classes should behave differently on
different ufuncs. I don’t think |np.minimum(ignore_na, b)| is any
different to |np.add(ignore_na, b)| - either both should produce |b|, or
both should produce |ignore_na|. I would lean towards produxing
|ignore_na|, and propagation behavior differing between “ignore” and
“invalid” only for |reduce| / |accumulate| operations, where the concept
of skipping an application is well-defined.
Some possible follow-up questions that having two distinct masked types
raise:
* what if I want my data to support both invalid and skip fields at
    the same time? sum([invalid, skip, 1]) == invalid
  * is there a use case for more that these two types of mask?
    |invalid_due_to_reason_A|, |invalid_due_to_reason_B| would be
    interesting things to track through a calculation, possibly a
    dictionary of named masks.
Eric
General comments on the last few emails:

For now I intentionally decided not to worry about NA masks in my
implementation. I want to get a first working implementation finished
for IGNORE style.

I agree it would be nice to have some support for NA style later, either
by a new MaskedArray subclass, a ducktype'd .mask attribute, or by some
other modification. In the latter category, consider that currently the
mask is stored as a boolean (1 byte) mask. One idea I have not put much
thought into is that internally we could make the mask a uint8, so
unmasked would be "0" IGNORE mask would be 1, and NA mask would be 2.
That allows mixing of mask types. Not sure how messy it would be to
implement.

For the idea of replacing the mask by a ducktype for NA style, my
instinct is that would be tricky. Many of the MaskedArray
__array_function__ method implementations, like sort and dot and many
others, do "complicated" computations using the mask that I don't think
you could easily get to work right by simply substituting a mask
ducktype. I think those would need to be reimplemented for NA style, in
other words you would need to make a MaskedArray subclass anyway.

Cheers,
Allan
...
On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <shoyer@gmail.com
<mailto:shoyer@gmail.com>> wrote:
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk
    <m.h.vankerkwijk@gmail.com <mailto:m.h.vankerkwijk@gmail.com>> wrote:
Your proposal would be something like np.sum(array,
            where=np.ones_like(array))? This seems rather verbose for a
            common operation. Perhaps np.sum(array, where=True) would
            work, making use of broadcasting? (I haven't actually
            checked whether this is well-defined yet.)
I think we'd need to consider separately the operation on the
        mask and on the data. In my proposal, the data would always do
        `np.sum(array, where=~mask)`, while how the mask would propagate
        might depend on the mask itself, i.e., we'd have different mask
        types for `skipna=True` (default) and `False` ("contagious")
        reductions, which differed in doing `logical_and.reduce` or
        `logical_or.reduce` on the mask.
OK, I think I finally understand what you're getting at. So suppose
    this this how we implement it internally. Would we really insist on
    a user creating a new MaskedArray with a new mask object, e.g., with
    a GreedyMask? We could add sugar for this, but certainly
    array.greedy_masked().sum() is significantly less clear than
    array.sum(skipna=False).
I'm also a little concerned about a proliferation of
    MaskedArray/Mask types. New types are significantly harder to
    understand than new functions (or new arguments on existing
    functions). I don't know if we have enough distinct use cases for
    this many types.
Are there use-cases for propagating masks separately from
            data? If not, it might make sense to only define mask
            operations along with data, which could be much simpler.
I had only thought about separating out the concern of mask
        propagation from the "MaskedArray" class to the mask proper, but
        it might indeed make things easier if the mask also did any
        required preparation for passing things on to the data (such as
        adjusting the "where" argument in a reduction). I also like that
        this way the mask can determine even before the data what
        functionality is available (i.e., it could be the place from
        which to return `NotImplemented` for a ufunc.at
        <http://ufunc.at> call with a masked index argument).
You're going to have to come up with something more compelling than
    "separation of concerns" to convince me that this extra Mask
    abstraction is worthwhile. On its own, I think a separate Mask class
    would only obfuscate MaskedArray functions.
For example, compare these two implementations of add:
def  add1(x, y):
        return MaskedArray(x.data + y.data,  x.mask | y.mask)
def  add2(x, y):
        return MaskedArray(x.data + y.data,  x.mask + y.mask)
The second version requires that you *also* know how Mask classes
    work, and how they implement +. So now you need to look in at least
    twice as many places to understand add() for MaskedArray objects.
    _______________________________________________
    NumPy-Discussion mailing list
    NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>
    https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Allan Haldane