
On Mon, Jun 24, 2019 at 5:36 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com> wrote:
I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.
The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation.
This does seem like a clean way to *implement* things, but from a user perspective I'm not sure I would want separate classes for "IGNORE" vs "NA" masks.
I tend to think of "IGNORE" vs "NA" as descriptions of particular operations rather than the data itself. There are a spectrum of ways to handle missing data, and the right way to propagating missing values is often highly context dependent. The right way to set this is in functions where operations are defined, not on classes that may be defined far away from where the computation happen. For example, pandas has a "min_count" parameter in functions for intermediate use-cases between "IGNORE" and "NA" semantics, e.g., "take an average, unless the number of data points is fewer than min_count."
Anything that specific like that is probably indeed outside of the purview of a MaskedArray class.
I agree that it doesn't make much sense to have a "min_count" attribute on a MaskedArray class, but certainly it makes sense for operations on MaskedArray objects, e.g., to write something like masked_array.mean(min_count=10). This is what users do in pandas today.
But your general point is well taken: we really need to ask clearly what the mask means not in terms of operations but conceptually.
Personally, I guess like Benjamin I have mostly thought of it as "data here is bad" (because corrupted, etc.) or "data here is irrelevant" (because of sea instead of land in a map). And I would like to proceed nevertheless with calculating things on the remainder. For an expectation value (or, less obviously, a minimum or maximum), this is mostly OK: just ignore the masked elements. But even for something as simple as a sum, what is correct is not obvious: if I ignore the count, I'm effectively assuming the expectation is symmetric around zero (this is why `vector.dot(vector)` fails); a better estimate would be `np.add.reduce(data, where=~mask) * N(total) / N(unmasked)`.
I think it's fine and logical to define default semantics for operations on MaskedArray objects. Much of the time, replacing masked values with 0 is the right thing to do for sum. Certainly IGNORE semantics are more useful overall than the NA semantics. But even if a MaskedArray conceptually always represents "bad" or "irrelevant" data, the way to handle those missing values will differ based on the use case, and not everything will fall cleanly into either IGNORE or NA buckets. I think it makes sense to provide users with functions/methods that expose these options, rather than requiring that they convert their data into a different type MaskedArray. "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." —Alan Perlis https://stackoverflow.com/questions/6016271/why-is-it-better-to-have-100-fun...