
On 6/24/19 12:16 PM, Stephan Hoyer wrote:
On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
1. Making a "no-clobber" guarantee on the underlying data
Hi Allan -- could kindly clarify what you mean by "no-clobber"?
Is this referring to allowing masked arrays to mutate masked data values in-place, even on apparently non-in-place operators? If so, that definitely seems like a bad idea to me. I would much rather do an unnecessary copy than have surprisingly non-thread-safe operations.
Yes. In my current implementation, the operation: >>> a = np.arange(6) >>> m = MaskedArray(a, mask=a < 3) >>> res = np.dot(m, m) will clobber elements of a. It appears that to avoid clobbering we will need to have np.dot make a copy. I also discuss how my implementation clobbers views in the docs: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md... I expect I could be convinced to make a no-clobber guarantee, if others agree it is better to accept the performance loss by making a copy internally. I just still have a hard time thinking of cases where clobbering is really that confusing, or easily avoidable by the user making an explicit copy. I like giving the user control over whether a copy is made or not, since I expect in the vast majority of cases a copy is unnecessary. I think it will be rare usage for people to hold on to the data array ("a" in the example above). Most of the time you create the MaskedArray on data created on the spot which you never touch directly again. We are all already used to numpy's "view" behavior (eg, for the np.array function), where if you don't explicitly make a copy of your orginal array you can expect further operations to modify it. Admittedly for MaskedArray it's a bit different since apparently readonly operations like np.dot can clobber, but again, it doesn't seem hard to know about or burdensome to avoid by explicit copy, and can give big performance advantages.
If we agree that masked values will contain nonsense, it seems like a bad idea for those values to be easily exposed.
Further, in all the comments so far I have not seen an example of a need for unmasking that is not more easily, efficiently and safely achieved by simply creating a new MaskedArray with a different mask.
My understanding is that essentially every low-level MaskedArray function is implemented by looking at the data and mask separately. If so, we should definitely expose this API directly to users (as part of the public API for MaskedArray), so they can write their own efficient algorithms.> As a concrete example, suppose I wanted to implement a low-level "grouped mean" operation for masked arrays like that found in pandas. This isn't a builtin NumPy function, so I would need to write this myself. This would be relatively straightforward to do in Numba or Cython with raw NumPy arrays (e.g., see my example here for a NaN skipping version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py), but to do it efficiently you definitely don't want to make an unnecessary copy.
The usual reason for hiding implementation details is when we want to reserve the right to change them. But if we're sure about the data model (which I think we are for MaskedArray) then I think there's a lot of value in exposing it directly to users, even if it's lower level than it appropriate to use in most cases.
Fair enough, I think it is all right to allow people access to ._data and make some guarantees about it if they are implementing subclasses or defining new ducktypes. There should be a section in the documentation describing what guanantees we make about ._data (or ._array if we change the name) and how/when to use it. Best, Allan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion