
On 6/23/19 6:58 PM, Eric Wieser wrote:
I think we’d need to consider separately the operation on the mask and on the data. In my proposal, the data would always do |np.sum(array, where=~mask)|, while how the mask would propagate might depend on the mask itself,
I quite like this idea, and I think Stephan’s strawman design is actually plausible, where |MaskedArray.mask| is either an |InvalidMask| or a |IgnoreMask| instance to pick between the different propagation types. Both classes could simply have an underlying |._array| attribute pointing to a duck-array of some kind that backs their boolean data.
The second version requires that you /also/ know how Mask classes work, and how they implement +
I remain unconvinced that Mask classes should behave differently on different ufuncs. I don’t think |np.minimum(ignore_na, b)| is any different to |np.add(ignore_na, b)| - either both should produce |b|, or both should produce |ignore_na|. I would lean towards produxing |ignore_na|, and propagation behavior differing between “ignore” and “invalid” only for |reduce| / |accumulate| operations, where the concept of skipping an application is well-defined.
Some possible follow-up questions that having two distinct masked types raise:
* what if I want my data to support both invalid and skip fields at the same time? sum([invalid, skip, 1]) == invalid * is there a use case for more that these two types of mask? |invalid_due_to_reason_A|, |invalid_due_to_reason_B| would be interesting things to track through a calculation, possibly a dictionary of named masks.
Eric
General comments on the last few emails: For now I intentionally decided not to worry about NA masks in my implementation. I want to get a first working implementation finished for IGNORE style. I agree it would be nice to have some support for NA style later, either by a new MaskedArray subclass, a ducktype'd .mask attribute, or by some other modification. In the latter category, consider that currently the mask is stored as a boolean (1 byte) mask. One idea I have not put much thought into is that internally we could make the mask a uint8, so unmasked would be "0" IGNORE mask would be 1, and NA mask would be 2. That allows mixing of mask types. Not sure how messy it would be to implement. For the idea of replacing the mask by a ducktype for NA style, my instinct is that would be tricky. Many of the MaskedArray __array_function__ method implementations, like sort and dot and many others, do "complicated" computations using the mask that I don't think you could easily get to work right by simply substituting a mask ducktype. I think those would need to be reimplemented for NA style, in other words you would need to make a MaskedArray subclass anyway. Cheers, Allan
On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <shoyer@gmail.com <mailto:shoyer@gmail.com>> wrote:
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <m.h.vankerkwijk@gmail.com <mailto:m.h.vankerkwijk@gmail.com>> wrote:
Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)
I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.
OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False).
I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types.
Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.
I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at <http://ufunc.at> call with a masked index argument).
You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions.
For example, compare these two implementations of add:
def add1(x, y): return MaskedArray(x.data + y.data, x.mask | y.mask)
def add2(x, y): return MaskedArray(x.data + y.data, x.mask + y.mask)
The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion