Re: [Numpy-discussion] new MaskedArray class

June 24, 2019

      On 6/24/19 12:16 PM, Stephan Hoyer wrote:
...
On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane <allanhaldane@gmail.com
<mailto:allanhaldane@gmail.com>> wrote:
 1. Making a "no-clobber" guarantee on the underlying data
Hi Allan -- could kindly clarify what you mean by "no-clobber"?
Is this referring to allowing masked arrays to mutate masked data values
in-place, even on apparently non-in-place operators? If so, that
definitely seems like a bad idea to me. I would much rather do an
unnecessary copy than have surprisingly non-thread-safe operations.
Yes. In my current implementation, the operation:

     >>> a = np.arange(6)
     >>> m = MaskedArray(a, mask=a < 3)
     >>> res = np.dot(m, m)

will clobber elements of a. It appears that to avoid clobbering we will
need to have np.dot make a copy. I also discuss how my implementation
clobbers views in the docs:

https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md...

I expect I could be convinced to make a no-clobber guarantee, if others
agree it is better to accept the performance loss by making a copy
internally.

I just still have a hard time thinking of cases where clobbering is
really that confusing, or easily avoidable by the user making an
explicit copy. I like giving the user control over whether a copy is
made or not, since I expect in the vast majority of cases a copy is
unnecessary.

I think it will be rare usage for people to hold on to the data array
("a" in the example above). Most of the time you create the MaskedArray
on data created on the spot which you never touch directly again. We are
all already used to numpy's "view" behavior (eg, for the np.array
function), where if you don't explicitly make a copy of your orginal
array you can expect further operations to modify it. Admittedly for
MaskedArray it's a bit different since apparently readonly operations
like np.dot can clobber, but again, it doesn't seem hard to know about
or burdensome to avoid by explicit copy, and can give big performance
advantages.
...
 If we agree that masked values will contain nonsense, it seems like a
    bad idea for those values to be easily exposed.
Further, in all the comments so far I have not seen an example of a need
    for unmasking that is not more easily, efficiently and safely achieved
    by simply creating a new MaskedArray with a different mask.
My understanding is that essentially every low-level MaskedArray
function is implemented by looking at the data and mask separately. If
so, we should definitely expose this API directly to users (as part of
the public API for MaskedArray), so they can write their own efficient
algorithms.>
As a concrete example, suppose I wanted to implement a low-level
"grouped mean" operation for masked arrays like that found in pandas.
This isn't a builtin NumPy function, so I would need to write this
myself. This would be relatively straightforward to do in Numba or
Cython with raw NumPy arrays (e.g., see my example here for a NaN
skipping
version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py),
but to do it efficiently you definitely don't want to make an
unnecessary copy.
The usual reason for hiding implementation details is when we want to
reserve the right to change them. But if we're sure about the data model
(which I think we are for MaskedArray) then I think there's a lot of
value in exposing it directly to users, even if it's lower level than it
appropriate to use in most cases.
Fair enough, I think it is all right to allow people access to ._data
and make some guarantees about it if they are implementing subclasses or
defining new ducktypes.

There should be a section in the documentation describing what
guanantees we make about ._data (or ._array if we change the name) and
how/when to use it.

Best,
Allan
...
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion