Re: [Numpy-discussion] new MaskedArray class

June 24, 2019

      On Mon, Jun 24, 2019 at 5:36 PM Marten van Kerkwijk <
m.h.vankerkwijk@gmail.com> wrote:
...
On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer@gmail.com> wrote:
...
On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com>
wrote:
...
I'm not at all set on that behavior and we can do something else. For
now, I chose this way since it seemed to best match the "IGNORE" mask
behavior.
The behavior you described further above where the output row/col would
be masked corresponds better to "NA" (propagating) mask behavior, which
I am leaving for later implementation.
This does seem like a clean way to *implement* things, but from a user
perspective I'm not sure I would want separate classes for "IGNORE" vs "NA"
masks.
I tend to think of "IGNORE" vs "NA" as descriptions of particular
operations rather than the data itself. There are a spectrum of ways to
handle missing data, and the right way to propagating missing values is
often highly context dependent. The right way to set this is in functions
where operations are defined, not on classes that may be defined far away
from where the computation happen. For example, pandas has a "min_count"
parameter in functions for intermediate use-cases between "IGNORE" and "NA"
semantics, e.g., "take an average, unless the number of data points is
fewer than min_count."
Anything that specific like that is probably indeed outside of the purview
of a MaskedArray class.
I agree that it doesn't make much sense to have a "min_count" attribute on
a MaskedArray class, but certainly it makes sense for operations on
MaskedArray objects, e.g., to write something like
masked_array.mean(min_count=10). This is what users do in pandas today.
...
But your general point is well taken: we really need to ask clearly what
the mask means not in terms of operations but conceptually.
Personally, I guess like Benjamin I have mostly thought of it as "data
here is bad" (because corrupted, etc.) or "data here is irrelevant"
(because of sea instead of land in a map). And I would like to proceed
nevertheless with calculating things on the remainder. For an expectation
value (or, less obviously, a minimum or maximum), this is mostly OK: just
ignore the masked elements. But even for something as simple as a sum, what
is correct is not obvious: if I ignore the count, I'm effectively assuming
the expectation is symmetric around zero (this is why `vector.dot(vector)`
fails); a better estimate would be `np.add.reduce(data, where=~mask) *
N(total) / N(unmasked)`.
I think it's fine and logical to define default semantics for operations on
MaskedArray objects. Much of the time, replacing masked values with 0 is
the right thing to do for sum. Certainly IGNORE semantics are more useful
overall than the NA semantics.

But even if a MaskedArray conceptually always represents "bad" or
"irrelevant" data, the way to handle those missing values will differ based
on the use case, and not everything will fall cleanly into either IGNORE or
NA buckets. I think it makes sense to provide users with functions/methods
that expose these options, rather than requiring that they convert their
data into a different type MaskedArray.

"It is better to have 100 functions operate on one data structure than 10
functions on 10 data structures." —Alan Perlis
https://stackoverflow.com/questions/6016271/why-is-it-better-to-have-100-fun...