Re: [Numpy-discussion] new MaskedArray class

June 24, 2019

      On 6/24/19 3:09 PM, Marten van Kerkwijk wrote:
...
Hi Allan,
Thanks for bringing up the noclobber explicitly (and Stephan for asking
for clarification; I was similarly confused).
It does clarify the difference in mental picture. In mine, the operation
would indeed be guaranteed to be done on the underlying data, without
copy and without `.filled(...)`. I should clarify further that I use
`where` only to skip reading elements (i.e., in reductions), not writing
elements (as you mention, the unwritten element will often be nonsense -
e.g., wrong units - which to me is worse than infinity or something
similar; I've not worried at all about runtime warnings). Note that my
main reason here is not that I'm against filling with numbers for
numerical arrays, but rather wanting to make minimal assumptions about
the underlying data itself. This may well be a mistake (but I want to
find out where it breaks).
Anyway, it would seem in many ways all the better that our models are
quite different. I definitely see the advantages of your choice to
decide one can do with masked data elements whatever is logical ahead of
an operation!
Thanks also for bringing up a useful example with `np.dot(m, m)` -
clearly, I didn't yet get beyond overriding ufuncs!
In my mental model, where I'd apply `np.dot` on the data and the mask
separately, the result will be wrong, so the mask has to be set (which
it would be). For your specific example, that might not be the best
solution, but when using `np.dot(matrix_shaped, matrix_shaped)`, I think
it does give the correct masking: any masked element in a matrix better
propagate to all parts that it influences, even if there is a reduction
of sorts happening. So, perhaps a price to pay for a function that tries
to do multiple things.
The alternative solution in my model would be to replace `np.dot` with a
masked-specific implementation of what `np.dot` is supposed to stand for
(in your simple example, `np.add.reduce(np.multiply(m, m))` - more
generally, add relevant `outer` and `axes`). This would be similar to
what I think all implementations do for `.mean()` - we cannot calculate
that from the data using any fill value or skipping, so rather use a
more easily cared-for `.sum()` and divide by a suitable number of
elements. But in both examples the disadvantage is that we took away the
option to use the underlying class's `.dot()` or `.mean()` implementations.
Just to note, my current implementation uses the IGNORE style of mask,
so does not propagate the mask in np.dot:

    >>> a = MaskedArray([[1,1,1], [1,X,1], [1,1,1]])
    >>> np.dot(a, a)

    MaskedArray([[3, 2, 3],
                 [2, 2, 2],
                 [3, 2, 3]])

I'm not at all set on that behavior and we can do something else. For
now, I chose this way since it seemed to best match the "IGNORE" mask
behavior.

The behavior you described further above where the output row/col would
be masked corresponds better to "NA" (propagating) mask behavior, which
I am leaving for later implementation.

best,
Allan
...
(Aside: considerations such as these underlie my longed-for exposure of
standard implementations of functions in terms of basic ufunc calls.)
Another example of a function for which I think my model is not
particularly insightful (and for which it is difficult to know what to
do generally) is `np.fft.fft`. Since an fft is equivalent to a
sine/cosine fits to data points, the answer for masked data is in
principle quite well-defined. But much less easy to implement!
All the best,
Marten
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Allan Haldane