
On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane <allanhaldane@gmail.com> wrote:
1. Making a "no-clobber" guarantee on the underlying data
Hi Allan -- could kindly clarify what you mean by "no-clobber"? Is this referring to allowing masked arrays to mutate masked data values in-place, even on apparently non-in-place operators? If so, that definitely seems like a bad idea to me. I would much rather do an unnecessary copy than have surprisingly non-thread-safe operations.
If we agree that masked values will contain nonsense, it seems like a bad idea for those values to be easily exposed.
Further, in all the comments so far I have not seen an example of a need for unmasking that is not more easily, efficiently and safely achieved by simply creating a new MaskedArray with a different mask.
My understanding is that essentially every low-level MaskedArray function is implemented by looking at the data and mask separately. If so, we should definitely expose this API directly to users (as part of the public API for MaskedArray), so they can write their own efficient algorithms. As a concrete example, suppose I wanted to implement a low-level "grouped mean" operation for masked arrays like that found in pandas. This isn't a builtin NumPy function, so I would need to write this myself. This would be relatively straightforward to do in Numba or Cython with raw NumPy arrays (e.g., see my example here for a NaN skipping version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py), but to do it efficiently you definitely don't want to make an unnecessary copy. The usual reason for hiding implementation details is when we want to reserve the right to change them. But if we're sure about the data model (which I think we are for MaskedArray) then I think there's a lot of value in exposing it directly to users, even if it's lower level than it appropriate to use in most cases.