
On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com> wrote:
I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.
The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation.
This does seem like a clean way to *implement* things, but from a user perspective I'm not sure I would want separate classes for "IGNORE" vs "NA" masks.
I tend to think of "IGNORE" vs "NA" as descriptions of particular operations rather than the data itself. There are a spectrum of ways to handle missing data, and the right way to propagating missing values is often highly context dependent. The right way to set this is in functions where operations are defined, not on classes that may be defined far away from where the computation happen. For example, pandas has a "min_count" parameter in functions for intermediate use-cases between "IGNORE" and "NA" semantics, e.g., "take an average, unless the number of data points is fewer than min_count."
Anything that specific like that is probably indeed outside of the purview of a MaskedArray class. But your general point is well taken: we really need to ask clearly what the mask means not in terms of operations but conceptually. Personally, I guess like Benjamin I have mostly thought of it as "data here is bad" (because corrupted, etc.) or "data here is irrelevant" (because of sea instead of land in a map). And I would like to proceed nevertheless with calculating things on the remainder. For an expectation value (or, less obviously, a minimum or maximum), this is mostly OK: just ignore the masked elements. But even for something as simple as a sum, what is correct is not obvious: if I ignore the count, I'm effectively assuming the expectation is symmetric around zero (this is why `vector.dot(vector)` fails); a better estimate would be `np.add.reduce(data, where=~mask) * N(total) / N(unmasked)`. Of course, the logical conclusion would be that this is not possible to do without guidance from the user, or, thinking more, that really a masked array is not at all what I want for this problem; really I am just using (1-mask) as a weight, and the sum of the weights matters, so I should have a WeightArray class where that is returned along with the sum of the data (or, a bit less extreme, a `CountArray` class, or, more extreme, a value and its uncertainty - heck, sounds a lot like my Variable class from 4 years ago, https://github.com/astropy/astropy/pull/3715, which even takes care of covariance [following the Uncertainty package]). OK, it seems I've definitely worked myself in a corner tonight where I'm not sure any more what a masked array is good for in the first place... I'll stop for the day! All the best, Marten