Mailman 3 new MaskedArray class - NumPy-Discussion

new MaskedArray class

older
NumPy Community Meeting Thursday,...

Allan Haldane

June 17, 2019

6:43 p.m.

Hi all, Chuck suggested we think about a MaskedArray replacement for 1.18. A few months ago I did some work on a MaskedArray replacement using `__array_function__`, which I got mostly working. It seems like a good time to bring it up for discussion now. See it at: https://github.com/ahaldane/ndarray_ducktypes It should be very usable, it has docs you can read, and it passes a pytest-suite with 1024 tests adapted from numpy's MaskedArray tests. What is missing? It needs even more tests for new functionality, and a couple numpy-API functions are missing, in particular `np.median`, `np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could also find many things to improve too. Besides fixing many little annoyances from MaskedArray, and simplifying the logic by always storing the mask in full, it also has new features. For instance it allows the use of a "X" variable to mark masked locations during array construction, and I solve the issue of how to mask individual fields of a structured array differently. At this point I would by happy to get some feedback on the design and what seems good or bad. If it seems like a good start, I'd be happy to move it into a numpy repo of some sort for further collaboration & discussion, and maybe into 1.18. At the least I hope it can serve as a design study of what we could do. Let me also drop here two more interesting detailed issues: First, the issue of what to do about .real and .imag of complex arrays, and similarly about field-assignment of structured arrays. The problem is that we have a single mask bool per element of a complex array, but what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the mask of the original array change? Should we make .real and .imag readonly? Second, a more general issue of how to ducktype scalars when using `__array_function__` which I think many ducktype implementors will have to face. For MaskedArray, I created an associated "MaskedScalar" type. However, MaskedScalar has to behave differently from normal numpy scalars in a number of ways: It is not part of the numpy scalar hierarchy, it fails checks `isinstance(var, np.floating)`, and np.isscalar returns false. Numpy scalar types cannot be subclassed. We have discussed before the need to have distinction between 0d-arrays and scalars, so we shouldn't just use a 0d (in fact, this makes printing very difficult). This leads me to think that in future dtype-overhaul plans, we should consider creating a subclassable `np.scalar` base type to wrap all numpy scalar variables, and code like `isinstance(var, np.floating)` should be replaced by `isinstance(var.dtype.type, np.floating)` or similar. That is, the numeric dtype of the scalar is no longer encoded in `type(var)` but in `var.dtype`: The fact that the variable is a numpy scalar is decoupled from its numeric dtype. This is useful because there are many "associated" properties of scalars in common with arrays which have nothing to do with the dtype, which ducktype implementors want to touch. I imagine this will come up a lot: In that repo I also have an "ArrayCollection" ducktype which required a "CollectionScalar" scalar, and similarly I imagine people implementing units want the units attached to the scalar, independently of the dtype. Cheers, Allan

Show replies by date

Marten van Kerkwijk

June 2019

10:06 a.m.

Hi Allen, Thanks for the message and link! In astropy, we've been struggling with masking a lot, and one of the main conclusions I have reached is that ideally one has a more abstract `Masked` class that can take any type of data (including `ndarray`, of course), and behaves like that data as much as possible, to the extent that if, e.g., I create a `Masked(Quantity(..., unit), mask=...)`, the instance will have a `.unit` attribute and perhaps even `isinstance(..., Quantity)` will hold. And similarly for `Masked(Time(...), mask=...)`, `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a kind of Mixin-class that just tracks a mask attribute. This may be too much to ask from the initializer, but, if so, it still seems most useful if it is made as easy as possible to do, say, `class MaskedQuantity(Masked, Quantity): <very few overrides>`. Even if this impossible, I think it is conceptually useful to think about what the masking class should do. My sense is that, e.g., it should not attempt to decide when an operation succeeds or not, but just "or together" input masks for regular, multiple-input functions, and let the underlying arrays skip elements for reductions by using `where` (hey, I did implement that for a reason... ;-). In particular, it suggests one should not have things like domains and all that (I never understood why `MaskedArray` did that). If one wants more, the class should provide a method that updates the mask (a sensible default might be `mask |= ~np.isfinite(result)` - here, the class being masked should logically support ufuncs and functions, so it can decide what "isfinite" means). In any case, I would think that a basic truth should be that everything has a mask with a shape consistent with the data, so 1. Each complex numbers has just one mask, and setting `a.imag` with a masked array should definitely propagate the mask. 2. For a masked array with structured dtype, I'd similarly say that the default is for a mask to have the same shape as the array. But that something like your collection makes sense for the case where one wants to mask items in a structure. All the best, Marten p.s. I started trying to implement the above "Mixin" class; will try to clean that up a bit so that at least it uses `where` and push it up. On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...

Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using `__array_function__`, which I got mostly working. It seems like a good time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a pytest-suite with 1024 tests adapted from numpy's MaskedArray tests. What is missing? It needs even more tests for new functionality, and a couple numpy-API functions are missing, in particular `np.median`, `np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying the logic by always storing the mask in full, it also has new features. For instance it allows the use of a "X" variable to mark masked locations during array construction, and I solve the issue of how to mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and what seems good or bad. If it seems like a good start, I'd be happy to move it into a numpy repo of some sort for further collaboration & discussion, and maybe into 1.18. At the least I hope it can serve as a design study of what we could do.

Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays, and similarly about field-assignment of structured arrays. The problem is that we have a single mask bool per element of a complex array, but what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using `__array_function__` which I think many ducktype implementors will have to face. For MaskedArray, I created an associated "MaskedScalar" type. However, MaskedScalar has to behave differently from normal numpy scalars in a number of ways: It is not part of the numpy scalar hierarchy, it fails checks `isinstance(var, np.floating)`, and np.isscalar returns false. Numpy scalar types cannot be subclassed. We have discussed before the need to have distinction between 0d-arrays and scalars, so we shouldn't just use a 0d (in fact, this makes printing very difficult). This leads me to think that in future dtype-overhaul plans, we should consider creating a subclassable `np.scalar` base type to wrap all numpy scalar variables, and code like `isinstance(var, np.floating)` should be replaced by `isinstance(var.dtype.type, np.floating)` or similar. That is, the numeric dtype of the scalar is no longer encoded in `type(var)` but in `var.dtype`: The fact that the variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars in common with arrays which have nothing to do with the dtype, which ducktype implementors want to touch. I imagine this will come up a lot: In that repo I also have an "ArrayCollection" ducktype which required a "CollectionScalar" scalar, and similarly I imagine people implementing units want the units attached to the scalar, independently of the dtype.

Cheers, Allan

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Allan Haldane

12:54 p.m.

On 6/18/19 10:06 AM, Marten van Kerkwijk wrote:

...

Hi Allen,

Thanks for the message and link! In astropy, we've been struggling with masking a lot, and one of the main conclusions I have reached is that ideally one has a more abstract `Masked` class that can take any type of data (including `ndarray`, of course), and behaves like that data as much as possible, to the extent that if, e.g., I create a `Masked(Quantity(..., unit), mask=...)`, the instance will have a `.unit` attribute and perhaps even `isinstance(..., Quantity)` will hold. And similarly for `Masked(Time(...), mask=...)`, `Masked(SkyCoord(...), mask=...)`, etc. In a way, `Masked` would be a kind of Mixin-class that just tracks a mask attribute.

This may be too much to ask from the initializer, but, if so, it still seems most useful if it is made as easy as possible to do, say, `class MaskedQuantity(Masked, Quantity): <very few overrides>`.

Currently MaskedArray does not accept ducktypes as underlying arrays, but I think it shouldn't be too hard to modify it to do so. Good idea! I already partly navigated this mixin-issue in the "MaskedArrayCollection" class, which essentially does ArrayCollection(MaskedArray(array)), and only takes about 30 lines of boilerplate. That's the backwards encapsulation order from what you want though.

...

Even if this impossible, I think it is conceptually useful to think about what the masking class should do. My sense is that, e.g., it should not attempt to decide when an operation succeeds or not, but just "or together" input masks for regular, multiple-input functions, and let the underlying arrays skip elements for reductions by using `where` (hey, I did implement that for a reason... ;-). In particular, it suggests one should not have things like domains and all that (I never understood why `MaskedArray` did that). If one wants more, the class should provide a method that updates the mask (a sensible default might be `mask |= ~np.isfinite(result)` - here, the class being masked should logically support ufuncs and functions, so it can decide what "isfinite" means).

I agree it would be nice to remove domains. It would make life easier, and I could remove a lot of twiddly code! I kept it in for now to minimize the behavior changes from the old MaskedArray.

...

In any case, I would think that a basic truth should be that everything has a mask with a shape consistent with the data, so 1. Each complex numbers has just one mask, and setting `a.imag` with a masked array should definitely propagate the mask. 2. For a masked array with structured dtype, I'd similarly say that the default is for a mask to have the same shape as the array. But that something like your collection makes sense for the case where one wants to mask items in a structure.

Agreed that we should have a single bool per complex or structured element, and the mask shape is the same as the array shape. That's how I implemented it. But there is still a problem with complex.imag assignment: >>> a = MaskedArray([1j, 2, X]) >>> i = a.imag >>> i[:] = MaskedArray([1, X, 1]) If we make the last line copy the mask to the original array, what should the real part of a[2] be? Conversely, if we don't copy the mask, what should the imag part of a[1] be? It seems like we might "want" the masks to be OR'd instead, but then should i[2] be masked after we just set it to 1?

...

All the best,

Marten

p.s. I started trying to implement the above "Mixin" class; will try to clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released. To avoid duplication of effort, I've attached a diff of what I tried. I actually get a slight slowdown of about 10% by using where... If you make progress with the mixin, a push is welcome. I imagine a problem is going to be that np.isscalar doesn't work to detect duck scalars. Cheers, Allan

...

On Mon, Jun 17, 2019 at 6:43 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

Hi all,

Chuck suggested we think about a MaskedArray replacement for 1.18.

A few months ago I did some work on a MaskedArray replacement using `__array_function__`, which I got mostly working. It seems like a good time to bring it up for discussion now. See it at:

https://github.com/ahaldane/ndarray_ducktypes

It should be very usable, it has docs you can read, and it passes a pytest-suite with 1024 tests adapted from numpy's MaskedArray tests. What is missing? It needs even more tests for new functionality, and a couple numpy-API functions are missing, in particular `np.median`, `np.percentile`, `np.cov`, and `np.corrcoef`. I'm sure other devs could also find many things to improve too.

Besides fixing many little annoyances from MaskedArray, and simplifying the logic by always storing the mask in full, it also has new features. For instance it allows the use of a "X" variable to mark masked locations during array construction, and I solve the issue of how to mask individual fields of a structured array differently.

At this point I would by happy to get some feedback on the design and what seems good or bad. If it seems like a good start, I'd be happy to move it into a numpy repo of some sort for further collaboration & discussion, and maybe into 1.18. At the least I hope it can serve as a design study of what we could do.

Let me also drop here two more interesting detailed issues:

First, the issue of what to do about .real and .imag of complex arrays, and similarly about field-assignment of structured arrays. The problem is that we have a single mask bool per element of a complex array, but what if someone does `arr.imag = MaskedArray([1,X,1])`? How should the mask of the original array change? Should we make .real and .imag readonly?

Second, a more general issue of how to ducktype scalars when using `__array_function__` which I think many ducktype implementors will have to face. For MaskedArray, I created an associated "MaskedScalar" type. However, MaskedScalar has to behave differently from normal numpy scalars in a number of ways: It is not part of the numpy scalar hierarchy, it fails checks `isinstance(var, np.floating)`, and np.isscalar returns false. Numpy scalar types cannot be subclassed. We have discussed before the need to have distinction between 0d-arrays and scalars, so we shouldn't just use a 0d (in fact, this makes printing very difficult). This leads me to think that in future dtype-overhaul plans, we should consider creating a subclassable `np.scalar` base type to wrap all numpy scalar variables, and code like `isinstance(var, np.floating)` should be replaced by `isinstance(var.dtype.type, np.floating)` or similar. That is, the numeric dtype of the scalar is no longer encoded in `type(var)` but in `var.dtype`: The fact that the variable is a numpy scalar is decoupled from its numeric dtype.

This is useful because there are many "associated" properties of scalars in common with arrays which have nothing to do with the dtype, which ducktype implementors want to touch. I imagine this will come up a lot: In that repo I also have an "ArrayCollection" ducktype which required a "CollectionScalar" scalar, and similarly I imagine people implementing units want the units attached to the scalar, independently of the dtype.

Cheers, Allan

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

2:04 p.m.

On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com> wrote: <snip>

...

...
This may be too much to ask from the initializer, but, if so, it still seems most useful if it is made as easy as possible to do, say, `class MaskedQuantity(Masked, Quantity): <very few overrides>`.

Currently MaskedArray does not accept ducktypes as underlying arrays, but I think it shouldn't be too hard to modify it to do so. Good idea!

Looking back at my trial, I see that I also never got to duck arrays - only ndarray subclasses - though I tried to make the code as agnostic as possible. (Trial at https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... ) I already partly navigated this mixin-issue in the

...

"MaskedArrayCollection" class, which essentially does ArrayCollection(MaskedArray(array)), and only takes about 30 lines of boilerplate. That's the backwards encapsulation order from what you want though.

Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, mask=[True, False, False])` does indeed not have a `.unit` attribute (and cannot represent itself...); I'm not at all sure that my method of just creating a mixed class is anything but a recipe for disaster, though!

...

...
Even if this impossible, I think it is conceptually useful to think about what the masking class should do. My sense is that, e.g., it should not attempt to decide when an operation succeeds or not, but just "or together" input masks for regular, multiple-input functions, and let the underlying arrays skip elements for reductions by using `where` (hey, I did implement that for a reason... ;-). In particular, it suggests one should not have things like domains and all that (I never understood why `MaskedArray` did that). If one wants more, the class should provide a method that updates the mask (a sensible default might be `mask |= ~np.isfinite(result)` - here, the class being masked should logically support ufuncs and functions, so it can decide what "isfinite" means).

I agree it would be nice to remove domains. It would make life easier, and I could remove a lot of twiddly code! I kept it in for now to minimize the behavior changes from the old MaskedArray.

That makes sense. Could be separated out to a backwards-compatibility class later.

...

...
In any case, I would think that a basic truth should be that everything has a mask with a shape consistent with the data, so 1. Each complex numbers has just one mask, and setting `a.imag` with a masked array should definitely propagate the mask. 2. For a masked array with structured dtype, I'd similarly say that the default is for a mask to have the same shape as the array. But that something like your collection makes sense for the case where one wants to mask items in a structure.

Agreed that we should have a single bool per complex or structured element, and the mask shape is the same as the array shape. That's how I implemented it. But there is still a problem with complex.imag assignment:

>>> a = MaskedArray([1j, 2, X]) >>> i = a.imag >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what should the real part of a[2] be? Conversely, if we don't copy the mask, what should the imag part of a[1] be? It seems like we might "want" the masks to be OR'd instead, but then should i[2] be masked after we just set it to 1?

Ah, I see the issue now... Easiest to implement and closest in analogy to a regular view would be to just let it unmask a[2] (with whatever is in real; user beware!).

Perhaps better would be to special-case such that `imag` returns a read-only view of the mask. Making `imag` itself read-only would prevent possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is no reason this should update the mask. Still, neither is really satisfactory...

...

...
p.s. I started trying to implement the above "Mixin" class; will try to clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released. To avoid duplication of effort, I've attached a diff of what I tried. I actually get a slight slowdown of about 10% by using where...

Your implementation is indeed quite similar to what I got in __array_ufunc__ (though one should "&" the where with ~mask). I think the main benefit is not to presume that whatever is underneath understands 0 or 1, i.e., avoid filling.

...

If you make progress with the mixin, a push is welcome. I imagine a problem is going to be that np.isscalar doesn't work to detect duck scalars.

I fear that in my attempts I've simply decided that only array scalars exist...

-- Marten

Allan Haldane

5:44 p.m.

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:

...

On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote: <snip>

> This may be too much to ask from the initializer, but, if so, it still > seems most useful if it is made as easy as possible to do, say, `class > MaskedQuantity(Masked, Quantity): <very few overrides>`.

Currently MaskedArray does not accept ducktypes as underlying arrays, but I think it shouldn't be too hard to modify it to do so. Good idea!

Looking back at my trial, I see that I also never got to duck arrays - only ndarray subclasses - though I tried to make the code as agnostic as possible.

(Trial at https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...)

I already partly navigated this mixin-issue in the "MaskedArrayCollection" class, which essentially does ArrayCollection(MaskedArray(array)), and only takes about 30 lines of boilerplate. That's the backwards encapsulation order from what you want though.

Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, mask=[True, False, False])` does indeed not have a `.unit` attribute (and cannot represent itself...); I'm not at all sure that my method of just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my MaskedArray more easily encapsulates both ducktypes and ndarray subclasses (pushed to repo). Here's an example I got working with masked units using unyt: [1]: from MaskedArray import X, MaskedArray, MaskedScalar [2]: from unyt import m, km [3]: import numpy as np [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) [5]: uarr MaskedArray([1., X , 3.]) [6]: uarr + 1*m MaskedArray([1.001, X , 3.001]) [7]: uarr.filled() unyt_array([1., 0., 3.], 'km') [8]: np.concatenate([uarr, 2*uarr]).filled() unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') The catch is the ducktype/subclass has to rigorously follow numpy's indexing rules, including distinguishing 0d arrays from scalars. For now only I used unyt in the example above since it happens to be less strict about dimensionless operations than astropy.units which trips up my repr code. (see below for example with astropy.units). Note in the last line I lost the dimensions, but that is because unyt does not handle np.concatenate. To get that to work we need a true ducktype for units. The example above doesn't expose the ".units" attribute outside the MaskedArray, and it doesn't print the units in the repr. But you can access them using "filled". While I could make MaskedArray forward unknown attribute accesses to the encapsulated array, that seems a bit dangerous/bug-prone at first glance, so probably I want to require the user to make a MaskedArray subclass to do so. I've just started playing with that (probably buggy), and Ive attached subclass examples for astropy.unit and unyt, with some example output below. Cheers, Allan Example using the attached astropy unit subclass: >>> from astropy.units import m, km, s >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) >>> uarr MaskedQ([1., X , 1.], units=km) >>> uarr.units km >>> uarr + (1*m) MaskedQ([1.001, X , 1.001], units=km) >>> uarr/(1*s) MaskedQ([1., X , 1.], units=km / s) >>> (uarr*(1*m))[1:] MaskedQ([X , 1.], units=km m) >>> np.add.outer(uarr, uarr) MaskedQ([[2., X , 2.], [X , X , X ], [2., X , 2.]], units=km) >>> print(uarr) [1. X 1.] km m Cheers, Allan

...

> Even if this impossible, I think it is conceptually useful to think > about what the masking class should do. My sense is that, e.g., it > should not attempt to decide when an operation succeeds or not, but just > "or together" input masks for regular, multiple-input functions, and let > the underlying arrays skip elements for reductions by using `where` > (hey, I did implement that for a reason... ;-). In particular, it > suggests one should not have things like domains and all that (I never > understood why `MaskedArray` did that). If one wants more, the class > should provide a method that updates the mask (a sensible default might > be `mask |= ~np.isfinite(result)` - here, the class being masked should > logically support ufuncs and functions, so it can decide what "isfinite" > means).

I agree it would be nice to remove domains. It would make life easier, and I could remove a lot of twiddly code! I kept it in for now to minimize the behavior changes from the old MaskedArray.

That makes sense. Could be separated out to a backwards-compatibility class later.

> In any case, I would think that a basic truth should be that everything > has a mask with a shape consistent with the data, so > 1. Each complex numbers has just one mask, and setting `a.imag` with a > masked array should definitely propagate the mask. > 2. For a masked array with structured dtype, I'd similarly say that the > default is for a mask to have the same shape as the array. But that > something like your collection makes sense for the case where one wants > to mask items in a structure.

Agreed that we should have a single bool per complex or structured element, and the mask shape is the same as the array shape. That's how I implemented it. But there is still a problem with complex.imag assignment:

>>> a = MaskedArray([1j, 2, X]) >>> i = a.imag >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what should the real part of a[2] be? Conversely, if we don't copy the mask, what should the imag part of a[1] be? It seems like we might "want" the masks to be OR'd instead, but then should i[2] be masked after we just set it to 1?

Ah, I see the issue now... Easiest to implement and closest in analogy to a regular view would be to just let it unmask a[2] (with whatever is in real; user beware!).

Perhaps better would be to special-case such that `imag` returns a read-only view of the mask. Making `imag` itself read-only would prevent possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is no reason this should update the mask.

Still, neither is really satisfactory...

> p.s. I started trying to implement the above "Mixin" class; will try to > clean that up a bit so that at least it uses `where` and push it up.

I played with "where", but didn't include it since 1.17 is not released. To avoid duplication of effort, I've attached a diff of what I tried. I actually get a slight slowdown of about 10% by using where...

Your implementation is indeed quite similar to what I got in __array_ufunc__ (though one should "&" the where with ~mask).

I think the main benefit is not to presume that whatever is underneath understands 0 or 1, i.e., avoid filling.

If you make progress with the mixin, a push is welcome. I imagine a problem is going to be that np.isscalar doesn't work to detect duck scalars.

I fear that in my attempts I've simply decided that only array scalars exist...

-- Marten

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

10:19 p.m.

Hi Allan, This is very impressive! I could get the tests that I wrote for my class pass with yours using Quantity with what I would consider very minimal changes. I only could not find a good way to unmask data (I like the idea of setting the mask on some elements via `ma[item] = X`); is this on purpose? Anyway, it would seem easily at the point where I should comment on your repository rather than in the mailing list! All the best, Marten On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:

...
On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote: <snip>

> This may be too much to ask from the initializer, but, if so, it

still

...
> seems most useful if it is made as easy as possible to do, say,

`class

...
> MaskedQuantity(Masked, Quantity): <very few overrides>`.

Currently MaskedArray does not accept ducktypes as underlying arrays, but I think it shouldn't be too hard to modify it to do so. Good

idea!

...
Looking back at my trial, I see that I also never got to duck arrays - only ndarray subclasses - though I tried to make the code as agnostic as possible.

(Trial at

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
I already partly navigated this mixin-issue in the "MaskedArrayCollection" class, which essentially does ArrayCollection(MaskedArray(array)), and only takes about 30 lines of boilerplate. That's the backwards encapsulation order from what you

want

...
though.

Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, mask=[True, False, False])` does indeed not have a `.unit` attribute (and cannot represent itself...); I'm not at all sure that my method of just creating a mixed class is anything but a recipe for disaster,

though!

Based on your suggestion I worked on this a little today, and now my MaskedArray more easily encapsulates both ducktypes and ndarray subclasses (pushed to repo). Here's an example I got working with masked units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.]) [6]: uarr + 1*m

MaskedArray([1.001, X , 3.001]) [7]: uarr.filled()

unyt_array([1., 0., 3.], 'km') [8]: np.concatenate([uarr, 2*uarr]).filled() unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's indexing rules, including distinguishing 0d arrays from scalars. For now only I used unyt in the example above since it happens to be less strict about dimensionless operations than astropy.units which trips up my repr code. (see below for example with astropy.units). Note in the last line I lost the dimensions, but that is because unyt does not handle np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the MaskedArray, and it doesn't print the units in the repr. But you can access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the encapsulated array, that seems a bit dangerous/bug-prone at first glance, so probably I want to require the user to make a MaskedArray subclass to do so. I've just started playing with that (probably buggy), and Ive attached subclass examples for astropy.unit and unyt, with some example output below.

Cheers, Allan

Example using the attached astropy unit subclass:

>>> from astropy.units import m, km, s >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) >>> uarr MaskedQ([1., X , 1.], units=km) >>> uarr.units km >>> uarr + (1*m) MaskedQ([1.001, X , 1.001], units=km) >>> uarr/(1*s) MaskedQ([1., X , 1.], units=km / s) >>> (uarr*(1*m))[1:] MaskedQ([X , 1.], units=km m) >>> np.add.outer(uarr, uarr) MaskedQ([[2., X , 2.], [X , X , X ], [2., X , 2.]], units=km) >>> print(uarr) [1. X 1.] km m

Cheers, Allan

...
> Even if this impossible, I think it is conceptually useful to think > about what the masking class should do. My sense is that, e.g., it > should not attempt to decide when an operation succeeds or not, but just > "or together" input masks for regular, multiple-input functions, and let > the underlying arrays skip elements for reductions by using `where` > (hey, I did implement that for a reason... ;-). In particular, it > suggests one should not have things like domains and all that (I

never

...
> understood why `MaskedArray` did that). If one wants more, the

class

...
> should provide a method that updates the mask (a sensible default might > be `mask |= ~np.isfinite(result)` - here, the class being masked should > logically support ufuncs and functions, so it can decide what "isfinite" > means).

I agree it would be nice to remove domains. It would make life

easier,

...
and I could remove a lot of twiddly code! I kept it in for now to minimize the behavior changes from the old MaskedArray.

That makes sense. Could be separated out to a backwards-compatibility class later.

> In any case, I would think that a basic truth should be that everything > has a mask with a shape consistent with the data, so > 1. Each complex numbers has just one mask, and setting `a.imag`

with a

...
> masked array should definitely propagate the mask. > 2. For a masked array with structured dtype, I'd similarly say that the > default is for a mask to have the same shape as the array. But that > something like your collection makes sense for the case where one wants > to mask items in a structure.

Agreed that we should have a single bool per complex or structured element, and the mask shape is the same as the array shape. That's

how I

...
implemented it. But there is still a problem with complex.imag assignment:

>>> a = MaskedArray([1j, 2, X]) >>> i = a.imag >>> i[:] = MaskedArray([1, X, 1])

If we make the last line copy the mask to the original array, what should the real part of a[2] be? Conversely, if we don't copy the

mask,

...
what should the imag part of a[1] be? It seems like we might "want"

the

...
masks to be OR'd instead, but then should i[2] be masked after we

just

...
set it to 1?

Ah, I see the issue now... Easiest to implement and closest in analogy to a regular view would be to just let it unmask a[2] (with whatever is in real; user beware!).

Perhaps better would be to special-case such that `imag` returns a read-only view of the mask. Making `imag` itself read-only would prevent possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is no reason this should update the mask.

Still, neither is really satisfactory...

> p.s. I started trying to implement the above "Mixin" class; will try to > clean that up a bit so that at least it uses `where` and push it

up.

...
I played with "where", but didn't include it since 1.17 is not

released.

...
To avoid duplication of effort, I've attached a diff of what I

tried. I

...
actually get a slight slowdown of about 10% by using where...

Your implementation is indeed quite similar to what I got in __array_ufunc__ (though one should "&" the where with ~mask).

I think the main benefit is not to presume that whatever is underneath understands 0 or 1, i.e., avoid filling.

If you make progress with the mixin, a push is welcome. I imagine a problem is going to be that np.isscalar doesn't work to detect duck scalars.

I fear that in my attempts I've simply decided that only array scalars exist...

-- Marten

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Allan Haldane

12:44 p.m.

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:

...

Hi Allan,

This is very impressive! I could get the tests that I wrote for my class pass with yours using Quantity with what I would consider very minimal changes. I only could not find a good way to unmask data (I like the idea of setting the mask on some elements via `ma[item] = X`); is this on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

...

Anyway, it would seem easily at the point where I should comment on your repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit. Best, Allan

...

All the best,

Marten

On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > <snip> > > > This may be too much to ask from the initializer, but, if so, it still > > seems most useful if it is made as easy as possible to do, say, `class > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > Currently MaskedArray does not accept ducktypes as underlying arrays, > but I think it shouldn't be too hard to modify it to do so. Good idea! > > > Looking back at my trial, I see that I also never got to duck arrays - > only ndarray subclasses - though I tried to make the code as agnostic as > possible. > > (Trial at > https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...) > > I already partly navigated this mixin-issue in the > "MaskedArrayCollection" class, which essentially does > ArrayCollection(MaskedArray(array)), and only takes about 30 lines of > boilerplate. That's the backwards encapsulation order from what you want > though. > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > mask=[True, False, False])` does indeed not have a `.unit` attribute > (and cannot represent itself...); I'm not at all sure that my method of > just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my MaskedArray more easily encapsulates both ducktypes and ndarray subclasses (pushed to repo). Here's an example I got working with masked units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.]) [6]: uarr + 1*m

MaskedArray([1.001, X , 3.001]) [7]: uarr.filled()

unyt_array([1., 0., 3.], 'km') [8]: np.concatenate([uarr, 2*uarr]).filled() unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's indexing rules, including distinguishing 0d arrays from scalars. For now only I used unyt in the example above since it happens to be less strict about dimensionless operations than astropy.units which trips up my repr code. (see below for example with astropy.units). Note in the last line I lost the dimensions, but that is because unyt does not handle np.concatenate. To get that to work we need a true ducktype for units.

The example above doesn't expose the ".units" attribute outside the MaskedArray, and it doesn't print the units in the repr. But you can access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to the encapsulated array, that seems a bit dangerous/bug-prone at first glance, so probably I want to require the user to make a MaskedArray subclass to do so. I've just started playing with that (probably buggy), and Ive attached subclass examples for astropy.unit and unyt, with some example output below.

Cheers, Allan

Example using the attached astropy unit subclass:

>>> from astropy.units import m, km, s >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) >>> uarr MaskedQ([1., X , 1.], units=km) >>> uarr.units km >>> uarr + (1*m) MaskedQ([1.001, X , 1.001], units=km) >>> uarr/(1*s) MaskedQ([1., X , 1.], units=km / s) >>> (uarr*(1*m))[1:] MaskedQ([X , 1.], units=km m) >>> np.add.outer(uarr, uarr) MaskedQ([[2., X , 2.], [X , X , X ], [2., X , 2.]], units=km) >>> print(uarr) [1. X 1.] km m

Cheers, Allan

> > Even if this impossible, I think it is conceptually useful to think > > about what the masking class should do. My sense is that, e.g., it > > should not attempt to decide when an operation succeeds or not, > but just > > "or together" input masks for regular, multiple-input functions, > and let > > the underlying arrays skip elements for reductions by using `where` > > (hey, I did implement that for a reason... ;-). In particular, it > > suggests one should not have things like domains and all that (I never > > understood why `MaskedArray` did that). If one wants more, the class > > should provide a method that updates the mask (a sensible default > might > > be `mask |= ~np.isfinite(result)` - here, the class being masked > should > > logically support ufuncs and functions, so it can decide what > "isfinite" > > means). > > I agree it would be nice to remove domains. It would make life easier, > and I could remove a lot of twiddly code! I kept it in for now to > minimize the behavior changes from the old MaskedArray. > > > That makes sense. Could be separated out to a backwards-compatibility > class later. > > > > In any case, I would think that a basic truth should be that > everything > > has a mask with a shape consistent with the data, so > > 1. Each complex numbers has just one mask, and setting `a.imag` with a > > masked array should definitely propagate the mask. > > 2. For a masked array with structured dtype, I'd similarly say > that the > > default is for a mask to have the same shape as the array. But that > > something like your collection makes sense for the case where one > wants > > to mask items in a structure. > > Agreed that we should have a single bool per complex or structured > element, and the mask shape is the same as the array shape. That's how I > implemented it. But there is still a problem with complex.imag > assignment: > > >>> a = MaskedArray([1j, 2, X]) > >>> i = a.imag > >>> i[:] = MaskedArray([1, X, 1]) > > If we make the last line copy the mask to the original array, what > should the real part of a[2] be? Conversely, if we don't copy the mask, > what should the imag part of a[1] be? It seems like we might "want" the > masks to be OR'd instead, but then should i[2] be masked after we just > set it to 1? > > Ah, I see the issue now... Easiest to implement and closest in analogy > to a regular view would be to just let it unmask a[2] (with whatever is > in real; user beware!). > > Perhaps better would be to special-case such that `imag` returns a > read-only view of the mask. Making `imag` itself read-only would prevent > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is > no reason this should update the mask. > > Still, neither is really satisfactory... > > > > > p.s. I started trying to implement the above "Mixin" class; will > try to > > clean that up a bit so that at least it uses `where` and push it up. > > I played with "where", but didn't include it since 1.17 is not released. > To avoid duplication of effort, I've attached a diff of what I tried. I > actually get a slight slowdown of about 10% by using where... > > > Your implementation is indeed quite similar to what I got in > __array_ufunc__ (though one should "&" the where with ~mask). > > I think the main benefit is not to presume that whatever is underneath > understands 0 or 1, i.e., avoid filling. > > > If you make progress with the mixin, a push is welcome. I imagine a > problem is going to be that np.isscalar doesn't work to detect duck > scalars. > > I fear that in my attempts I've simply decided that only array scalars > exist... > > -- Marten > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Benjamin Root

2:37 p.m.

Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks? Cheers! Ben Root On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:

...
Hi Allan,

This is very impressive! I could get the tests that I wrote for my class pass with yours using Quantity with what I would consider very minimal changes. I only could not find a good way to unmask data (I like the idea of setting the mask on some elements via `ma[item] = X`); is this on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

...
Anyway, it would seem easily at the point where I should comment on your repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

...
All the best,

Marten

On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > <snip> > > > This may be too much to ask from the initializer, but, if so, it still > > seems most useful if it is made as easy as possible to do, say, `class > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > Currently MaskedArray does not accept ducktypes as underlying arrays, > but I think it shouldn't be too hard to modify it to do so. Good idea! > > > Looking back at my trial, I see that I also never got to duck

arrays -

...
> only ndarray subclasses - though I tried to make the code as agnostic as > possible. > > (Trial at >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > I already partly navigated this mixin-issue in the > "MaskedArrayCollection" class, which essentially does > ArrayCollection(MaskedArray(array)), and only takes about 30 lines of > boilerplate. That's the backwards encapsulation order from what you want > though. > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > mask=[True, False, False])` does indeed not have a `.unit`

attribute

...
> (and cannot represent itself...); I'm not at all sure that my method of > just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my MaskedArray more easily encapsulates both ducktypes and ndarray subclasses (pushed to repo). Here's an example I got working with

masked

...
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.]) [6]: uarr + 1*m

MaskedArray([1.001, X , 3.001]) [7]: uarr.filled()

unyt_array([1., 0., 3.], 'km') [8]: np.concatenate([uarr, 2*uarr]).filled() unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's indexing rules, including distinguishing 0d arrays from scalars. For

now

...
only I used unyt in the example above since it happens to be less

strict

...
about dimensionless operations than astropy.units which trips up my repr code. (see below for example with astropy.units). Note in the

last

...
line I lost the dimensions, but that is because unyt does not handle np.concatenate. To get that to work we need a true ducktype for

units.

...
The example above doesn't expose the ".units" attribute outside the MaskedArray, and it doesn't print the units in the repr. But you can access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to

the

...
encapsulated array, that seems a bit dangerous/bug-prone at first glance, so probably I want to require the user to make a MaskedArray subclass to do so. I've just started playing with that (probably

buggy),

...
and Ive attached subclass examples for astropy.unit and unyt, with

some

...
example output below.

Cheers, Allan

Example using the attached astropy unit subclass:

>>> from astropy.units import m, km, s >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) >>> uarr MaskedQ([1., X , 1.], units=km) >>> uarr.units km >>> uarr + (1*m) MaskedQ([1.001, X , 1.001], units=km) >>> uarr/(1*s) MaskedQ([1., X , 1.], units=km / s) >>> (uarr*(1*m))[1:] MaskedQ([X , 1.], units=km m) >>> np.add.outer(uarr, uarr) MaskedQ([[2., X , 2.], [X , X , X ], [2., X , 2.]], units=km) >>> print(uarr) [1. X 1.] km m

Cheers, Allan

> > Even if this impossible, I think it is conceptually useful to think > > about what the masking class should do. My sense is that, e.g., it > > should not attempt to decide when an operation succeeds or

not,

...
> but just > > "or together" input masks for regular, multiple-input

functions,

...
> and let > > the underlying arrays skip elements for reductions by using `where` > > (hey, I did implement that for a reason... ;-). In particular, it > > suggests one should not have things like domains and all that (I never > > understood why `MaskedArray` did that). If one wants more, the class > > should provide a method that updates the mask (a sensible default > might > > be `mask |= ~np.isfinite(result)` - here, the class being

masked

...
> should > > logically support ufuncs and functions, so it can decide what > "isfinite" > > means). > > I agree it would be nice to remove domains. It would make life easier, > and I could remove a lot of twiddly code! I kept it in for now

to

...
> minimize the behavior changes from the old MaskedArray. > > > That makes sense. Could be separated out to a

backwards-compatibility

...
> class later. > > > > In any case, I would think that a basic truth should be that > everything > > has a mask with a shape consistent with the data, so > > 1. Each complex numbers has just one mask, and setting `a.imag` with a > > masked array should definitely propagate the mask. > > 2. For a masked array with structured dtype, I'd similarly

say

...
> that the > > default is for a mask to have the same shape as the array. But that > > something like your collection makes sense for the case where one > wants > > to mask items in a structure. > > Agreed that we should have a single bool per complex or

structured

...
> element, and the mask shape is the same as the array shape. That's how I > implemented it. But there is still a problem with complex.imag > assignment: > > >>> a = MaskedArray([1j, 2, X]) > >>> i = a.imag > >>> i[:] = MaskedArray([1, X, 1]) > > If we make the last line copy the mask to the original array,

what

...
> should the real part of a[2] be? Conversely, if we don't copy the mask, > what should the imag part of a[1] be? It seems like we might "want" the > masks to be OR'd instead, but then should i[2] be masked after we just > set it to 1? > > Ah, I see the issue now... Easiest to implement and closest in

analogy

...
> to a regular view would be to just let it unmask a[2] (with whatever is > in real; user beware!). > > Perhaps better would be to special-case such that `imag` returns a > read-only view of the mask. Making `imag` itself read-only would prevent > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is > no reason this should update the mask. > > Still, neither is really satisfactory... > > > > > p.s. I started trying to implement the above "Mixin" class;

will

...
> try to > > clean that up a bit so that at least it uses `where` and push it up. > > I played with "where", but didn't include it since 1.17 is not released. > To avoid duplication of effort, I've attached a diff of what I tried. I > actually get a slight slowdown of about 10% by using where... > > > Your implementation is indeed quite similar to what I got in > __array_ufunc__ (though one should "&" the where with ~mask). > > I think the main benefit is not to presume that whatever is

underneath

...
> understands 0 or 1, i.e., avoid filling. > > > If you make progress with the mixin, a push is welcome. I imagine a > problem is going to be that np.isscalar doesn't work to detect duck > scalars. > > I fear that in my attempts I've simply decided that only array

scalars

...
> exist... > > -- Marten > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Allan Haldane

10:54 a.m.

On 6/21/19 2:37 PM, Benjamin Root wrote:

...

Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers! Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to. I'd like to try to support reasonable use-cases like yours though. A few thoughts: First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support. Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do >>> result = MaskedArray(data.copy(), trial_mask).sum() instead of >>> marr.mask = trial_mask >>> result = marr.sum() since they have similar performance? Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs. In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind. Best, Allan

...

On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > Hi Allan, > > This is very impressive! I could get the tests that I wrote for my class > pass with yours using Quantity with what I would consider very minimal > changes. I only could not find a good way to unmask data (I like the > idea of setting the mask on some elements via `ma[item] = X`); is this > on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment on your > repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

> > All the best, > > Marten > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > <snip> > > > > > This may be too much to ask from the initializer, but, if > so, it still > > > seems most useful if it is made as easy as possible to do, > say, `class > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > Currently MaskedArray does not accept ducktypes as underlying > arrays, > > but I think it shouldn't be too hard to modify it to do so. > Good idea! > > > > > > Looking back at my trial, I see that I also never got to duck arrays - > > only ndarray subclasses - though I tried to make the code as > agnostic as > > possible. > > > > (Trial at > > > https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...) > > > > I already partly navigated this mixin-issue in the > > "MaskedArrayCollection" class, which essentially does > > ArrayCollection(MaskedArray(array)), and only takes about 30 > lines of > > boilerplate. That's the backwards encapsulation order from > what you want > > though. > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > > mask=[True, False, False])` does indeed not have a `.unit` attribute > > (and cannot represent itself...); I'm not at all sure that my > method of > > just creating a mixed class is anything but a recipe for disaster, > though! > > Based on your suggestion I worked on this a little today, and now my > MaskedArray more easily encapsulates both ducktypes and ndarray > subclasses (pushed to repo). Here's an example I got working with masked > units using unyt: > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > [2]: from unyt import m, km > > [3]: import numpy as np > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > [5]: uarr > > MaskedArray([1., X , 3.]) > [6]: uarr + 1*m > > MaskedArray([1.001, X , 3.001]) > [7]: uarr.filled() > > unyt_array([1., 0., 3.], 'km') > [8]: np.concatenate([uarr, 2*uarr]).filled() > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > The catch is the ducktype/subclass has to rigorously follow numpy's > indexing rules, including distinguishing 0d arrays from scalars. For now > only I used unyt in the example above since it happens to be less strict > about dimensionless operations than astropy.units which trips up my > repr code. (see below for example with astropy.units). Note in the last > line I lost the dimensions, but that is because unyt does not handle > np.concatenate. To get that to work we need a true ducktype for units. > > The example above doesn't expose the ".units" attribute outside the > MaskedArray, and it doesn't print the units in the repr. But you can > access them using "filled". > > While I could make MaskedArray forward unknown attribute accesses to the > encapsulated array, that seems a bit dangerous/bug-prone at first > glance, so probably I want to require the user to make a MaskedArray > subclass to do so. I've just started playing with that (probably buggy), > and Ive attached subclass examples for astropy.unit and unyt, with some > example output below. > > Cheers, > Allan > > > > Example using the attached astropy unit subclass: > > >>> from astropy.units import m, km, s > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > >>> uarr > MaskedQ([1., X , 1.], units=km) > >>> uarr.units > km > >>> uarr + (1*m) > MaskedQ([1.001, X , 1.001], units=km) > >>> uarr/(1*s) > MaskedQ([1., X , 1.], units=km / s) > >>> (uarr*(1*m))[1:] > MaskedQ([X , 1.], units=km m) > >>> np.add.outer(uarr, uarr) > MaskedQ([[2., X , 2.], > [X , X , X ], > [2., X , 2.]], units=km) > >>> print(uarr) > [1. X 1.] km m > > Cheers, > Allan > > > > > Even if this impossible, I think it is conceptually useful > to think > > > about what the masking class should do. My sense is that, > e.g., it > > > should not attempt to decide when an operation succeeds or not, > > but just > > > "or together" input masks for regular, multiple-input functions, > > and let > > > the underlying arrays skip elements for reductions by using > `where` > > > (hey, I did implement that for a reason... ;-). In > particular, it > > > suggests one should not have things like domains and all > that (I never > > > understood why `MaskedArray` did that). If one wants more, > the class > > > should provide a method that updates the mask (a sensible > default > > might > > > be `mask |= ~np.isfinite(result)` - here, the class being masked > > should > > > logically support ufuncs and functions, so it can decide what > > "isfinite" > > > means). > > > > I agree it would be nice to remove domains. It would make life > easier, > > and I could remove a lot of twiddly code! I kept it in for now to > > minimize the behavior changes from the old MaskedArray. > > > > > > That makes sense. Could be separated out to a backwards-compatibility > > class later. > > > > > > > In any case, I would think that a basic truth should be that > > everything > > > has a mask with a shape consistent with the data, so > > > 1. Each complex numbers has just one mask, and setting > `a.imag` with a > > > masked array should definitely propagate the mask. > > > 2. For a masked array with structured dtype, I'd similarly say > > that the > > > default is for a mask to have the same shape as the array. > But that > > > something like your collection makes sense for the case > where one > > wants > > > to mask items in a structure. > > > > Agreed that we should have a single bool per complex or structured > > element, and the mask shape is the same as the array shape. > That's how I > > implemented it. But there is still a problem with complex.imag > > assignment: > > > > >>> a = MaskedArray([1j, 2, X]) > > >>> i = a.imag > > >>> i[:] = MaskedArray([1, X, 1]) > > > > If we make the last line copy the mask to the original array, what > > should the real part of a[2] be? Conversely, if we don't copy > the mask, > > what should the imag part of a[1] be? It seems like we might > "want" the > > masks to be OR'd instead, but then should i[2] be masked after > we just > > set it to 1? > > > > Ah, I see the issue now... Easiest to implement and closest in analogy > > to a regular view would be to just let it unmask a[2] (with > whatever is > > in real; user beware!). > > > > Perhaps better would be to special-case such that `imag` returns a > > read-only view of the mask. Making `imag` itself read-only would > prevent > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but > there is > > no reason this should update the mask. > > > > Still, neither is really satisfactory... > > > > > > > > > p.s. I started trying to implement the above "Mixin" class; will > > try to > > > clean that up a bit so that at least it uses `where` and > push it up. > > > > I played with "where", but didn't include it since 1.17 is not > released. > > To avoid duplication of effort, I've attached a diff of what I > tried. I > > actually get a slight slowdown of about 10% by using where... > > > > > > Your implementation is indeed quite similar to what I got in > > __array_ufunc__ (though one should "&" the where with ~mask). > > > > I think the main benefit is not to presume that whatever is underneath > > understands 0 or 1, i.e., avoid filling. > > > > > > If you make progress with the mixin, a push is welcome. I > imagine a > > problem is going to be that np.isscalar doesn't work to detect > duck > > scalars. > > > > I fear that in my attempts I've simply decided that only array scalars > > exist... > > > > -- Marten > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

11:50 a.m.

Hi Allan, I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work. I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?). With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user. As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not. All the best, Marten On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com> wrote:

...

On 6/21/19 2:37 PM, Benjamin Root wrote:

...
Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers! Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind.

Best, Allan

...
On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > Hi Allan, > > This is very impressive! I could get the tests that I wrote for my class > pass with yours using Quantity with what I would consider very minimal > changes. I only could not find a good way to unmask data (I like the > idea of setting the mask on some elements via `ma[item] = X`); is this > on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment on your > repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

> > All the best, > > Marten > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > <snip> > > > > > This may be too much to ask from the initializer, but, if > so, it still > > > seems most useful if it is made as easy as possible to do, > say, `class > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > Currently MaskedArray does not accept ducktypes as underlying > arrays, > > but I think it shouldn't be too hard to modify it to do so. > Good idea! > > > > > > Looking back at my trial, I see that I also never got to duck arrays - > > only ndarray subclasses - though I tried to make the code as > agnostic as > > possible. > > > > (Trial at > > >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > > > I already partly navigated this mixin-issue in the > > "MaskedArrayCollection" class, which essentially does > > ArrayCollection(MaskedArray(array)), and only takes

about 30

...
> lines of > > boilerplate. That's the backwards encapsulation order

from

...
> what you want > > though. > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > > mask=[True, False, False])` does indeed not have a `.unit` attribute > > (and cannot represent itself...); I'm not at all sure that my > method of > > just creating a mixed class is anything but a recipe for disaster, > though! > > Based on your suggestion I worked on this a little today, and now my > MaskedArray more easily encapsulates both ducktypes and ndarray > subclasses (pushed to repo). Here's an example I got working with masked > units using unyt: > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > [2]: from unyt import m, km > > [3]: import numpy as np > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > [5]: uarr > > MaskedArray([1., X , 3.]) > [6]: uarr + 1*m > > MaskedArray([1.001, X , 3.001]) > [7]: uarr.filled() > > unyt_array([1., 0., 3.], 'km') > [8]: np.concatenate([uarr, 2*uarr]).filled() > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > The catch is the ducktype/subclass has to rigorously follow numpy's > indexing rules, including distinguishing 0d arrays from scalars. For now > only I used unyt in the example above since it happens to be less strict > about dimensionless operations than astropy.units which trips up my > repr code. (see below for example with astropy.units). Note in the last > line I lost the dimensions, but that is because unyt does not handle > np.concatenate. To get that to work we need a true ducktype for units. > > The example above doesn't expose the ".units" attribute outside the > MaskedArray, and it doesn't print the units in the repr. But you can > access them using "filled". > > While I could make MaskedArray forward unknown attribute accesses to the > encapsulated array, that seems a bit dangerous/bug-prone at

first

...
> glance, so probably I want to require the user to make a MaskedArray > subclass to do so. I've just started playing with that (probably buggy), > and Ive attached subclass examples for astropy.unit and unyt, with some > example output below. > > Cheers, > Allan > > > > Example using the attached astropy unit subclass: > > >>> from astropy.units import m, km, s > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > >>> uarr > MaskedQ([1., X , 1.], units=km) > >>> uarr.units > km > >>> uarr + (1*m) > MaskedQ([1.001, X , 1.001], units=km) > >>> uarr/(1*s) > MaskedQ([1., X , 1.], units=km / s) > >>> (uarr*(1*m))[1:] > MaskedQ([X , 1.], units=km m) > >>> np.add.outer(uarr, uarr) > MaskedQ([[2., X , 2.], > [X , X , X ], > [2., X , 2.]], units=km) > >>> print(uarr) > [1. X 1.] km m > > Cheers, > Allan > > > > > Even if this impossible, I think it is conceptually

useful

...
> to think > > > about what the masking class should do. My sense is

that,

...
> e.g., it > > > should not attempt to decide when an operation succeeds or not, > > but just > > > "or together" input masks for regular, multiple-input functions, > > and let > > > the underlying arrays skip elements for reductions by using > `where` > > > (hey, I did implement that for a reason... ;-). In > particular, it > > > suggests one should not have things like domains and

all

...
> that (I never > > > understood why `MaskedArray` did that). If one wants

more,

...
> the class > > > should provide a method that updates the mask (a

sensible

...
> default > > might > > > be `mask |= ~np.isfinite(result)` - here, the class being masked > > should > > > logically support ufuncs and functions, so it can decide what > > "isfinite" > > > means). > > > > I agree it would be nice to remove domains. It would make life > easier, > > and I could remove a lot of twiddly code! I kept it in for now to > > minimize the behavior changes from the old MaskedArray. > > > > > > That makes sense. Could be separated out to a backwards-compatibility > > class later. > > > > > > > In any case, I would think that a basic truth should be that > > everything > > > has a mask with a shape consistent with the data, so > > > 1. Each complex numbers has just one mask, and setting > `a.imag` with a > > > masked array should definitely propagate the mask. > > > 2. For a masked array with structured dtype, I'd similarly say > > that the > > > default is for a mask to have the same shape as the

array.

...
> But that > > > something like your collection makes sense for the case > where one > > wants > > > to mask items in a structure. > > > > Agreed that we should have a single bool per complex or structured > > element, and the mask shape is the same as the array

shape.

...
> That's how I > > implemented it. But there is still a problem with complex.imag > > assignment: > > > > >>> a = MaskedArray([1j, 2, X]) > > >>> i = a.imag > > >>> i[:] = MaskedArray([1, X, 1]) > > > > If we make the last line copy the mask to the original array, what > > should the real part of a[2] be? Conversely, if we don't copy > the mask, > > what should the imag part of a[1] be? It seems like we

might

...
> "want" the > > masks to be OR'd instead, but then should i[2] be masked after > we just > > set it to 1? > > > > Ah, I see the issue now... Easiest to implement and closest in analogy > > to a regular view would be to just let it unmask a[2] (with > whatever is > > in real; user beware!). > > > > Perhaps better would be to special-case such that `imag` returns a > > read-only view of the mask. Making `imag` itself read-only

would

...
> prevent > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -

but

...
> there is > > no reason this should update the mask. > > > > Still, neither is really satisfactory... > > > > > > > > > p.s. I started trying to implement the above "Mixin" class; will > > try to > > > clean that up a bit so that at least it uses `where`

and

...
> push it up. > > > > I played with "where", but didn't include it since 1.17 is not > released. > > To avoid duplication of effort, I've attached a diff of what I > tried. I > > actually get a slight slowdown of about 10% by using where... > > > > > > Your implementation is indeed quite similar to what I got in > > __array_ufunc__ (though one should "&" the where with ~mask). > > > > I think the main benefit is not to presume that whatever is underneath > > understands 0 or 1, i.e., avoid filling. > > > > > > If you make progress with the mixin, a push is welcome. I > imagine a > > problem is going to be that np.isscalar doesn't work to detect > duck > > scalars. > > > > I fear that in my attempts I've simply decided that only array scalars > > exist... > > > > -- Marten > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Benjamin Root

9:21 p.m.

"""Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c""" My use-cases don't involve changing the mask of "c". It would involve changing the mask of "a" or "b" after I have calculated "c", so that I could calculate "d". As a fairly simple example, I frequently work with satellite data. We have multiple masks, such as water, vegetation, sandy loam, bare rock, etc. The underlying satellite data in any of these places isn't bad, they just need to be dealt with differently. I wouldn't want the act of applying a mask for a set of calculations on things that aren't bare rock to mess up my subsequent calculation on things that aren't water. Right now, I have to handle this explicitly with flattened sparse arrays, which makes visualization and conception difficult. Ben Root On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com> wrote:

...
On 6/21/19 2:37 PM, Benjamin Root wrote:

...
Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers! Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind.

Best, Allan

...
On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > Hi Allan, > > This is very impressive! I could get the tests that I wrote for my class > pass with yours using Quantity with what I would consider very minimal > changes. I only could not find a good way to unmask data (I like the > idea of setting the mask on some elements via `ma[item] = X`); is this > on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment on your > repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

> > All the best, > > Marten > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > <snip> > > > > > This may be too much to ask from the initializer, but, if > so, it still > > > seems most useful if it is made as easy as possible to do, > say, `class > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > Currently MaskedArray does not accept ducktypes as underlying > arrays, > > but I think it shouldn't be too hard to modify it to do so. > Good idea! > > > > > > Looking back at my trial, I see that I also never got to duck arrays - > > only ndarray subclasses - though I tried to make the code as > agnostic as > > possible. > > > > (Trial at > > >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > > > I already partly navigated this mixin-issue in the > > "MaskedArrayCollection" class, which essentially does > > ArrayCollection(MaskedArray(array)), and only takes

about 30

...
> lines of > > boilerplate. That's the backwards encapsulation order

from

...
> what you want > > though. > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > > mask=[True, False, False])` does indeed not have a `.unit` attribute > > (and cannot represent itself...); I'm not at all sure that

my

...
> method of > > just creating a mixed class is anything but a recipe for disaster, > though! > > Based on your suggestion I worked on this a little today, and now my > MaskedArray more easily encapsulates both ducktypes and

ndarray

...
> subclasses (pushed to repo). Here's an example I got working with masked > units using unyt: > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > [2]: from unyt import m, km > > [3]: import numpy as np > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > [5]: uarr > > MaskedArray([1., X , 3.]) > [6]: uarr + 1*m > > MaskedArray([1.001, X , 3.001]) > [7]: uarr.filled() > > unyt_array([1., 0., 3.], 'km') > [8]: np.concatenate([uarr, 2*uarr]).filled() > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > The catch is the ducktype/subclass has to rigorously follow numpy's > indexing rules, including distinguishing 0d arrays from scalars. For now > only I used unyt in the example above since it happens to be less strict > about dimensionless operations than astropy.units which trips up my > repr code. (see below for example with astropy.units). Note in the last > line I lost the dimensions, but that is because unyt does not handle > np.concatenate. To get that to work we need a true ducktype for units. > > The example above doesn't expose the ".units" attribute outside the > MaskedArray, and it doesn't print the units in the repr. But you can > access them using "filled". > > While I could make MaskedArray forward unknown attribute accesses to the > encapsulated array, that seems a bit dangerous/bug-prone at

first

...
> glance, so probably I want to require the user to make a MaskedArray > subclass to do so. I've just started playing with that (probably buggy), > and Ive attached subclass examples for astropy.unit and unyt, with some > example output below. > > Cheers, > Allan > > > > Example using the attached astropy unit subclass: > > >>> from astropy.units import m, km, s > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > >>> uarr > MaskedQ([1., X , 1.], units=km) > >>> uarr.units > km > >>> uarr + (1*m) > MaskedQ([1.001, X , 1.001], units=km) > >>> uarr/(1*s) > MaskedQ([1., X , 1.], units=km / s) > >>> (uarr*(1*m))[1:] > MaskedQ([X , 1.], units=km m) > >>> np.add.outer(uarr, uarr) > MaskedQ([[2., X , 2.], > [X , X , X ], > [2., X , 2.]], units=km) > >>> print(uarr) > [1. X 1.] km m > > Cheers, > Allan > > > > > Even if this impossible, I think it is conceptually

useful

...
> to think > > > about what the masking class should do. My sense is

that,

...
> e.g., it > > > should not attempt to decide when an operation succeeds or not, > > but just > > > "or together" input masks for regular, multiple-input functions, > > and let > > > the underlying arrays skip elements for reductions by using > `where` > > > (hey, I did implement that for a reason... ;-). In > particular, it > > > suggests one should not have things like domains and

all

...
> that (I never > > > understood why `MaskedArray` did that). If one wants

more,

...
> the class > > > should provide a method that updates the mask (a

sensible

...
> default > > might > > > be `mask |= ~np.isfinite(result)` - here, the class being masked > > should > > > logically support ufuncs and functions, so it can decide what > > "isfinite" > > > means). > > > > I agree it would be nice to remove domains. It would make life > easier, > > and I could remove a lot of twiddly code! I kept it in for now to > > minimize the behavior changes from the old MaskedArray. > > > > > > That makes sense. Could be separated out to a backwards-compatibility > > class later. > > > > > > > In any case, I would think that a basic truth should be that > > everything > > > has a mask with a shape consistent with the data, so > > > 1. Each complex numbers has just one mask, and setting > `a.imag` with a > > > masked array should definitely propagate the mask. > > > 2. For a masked array with structured dtype, I'd similarly say > > that the > > > default is for a mask to have the same shape as the

array.

...
> But that > > > something like your collection makes sense for the

case

...
> where one > > wants > > > to mask items in a structure. > > > > Agreed that we should have a single bool per complex or structured > > element, and the mask shape is the same as the array

shape.

...
> That's how I > > implemented it. But there is still a problem with complex.imag > > assignment: > > > > >>> a = MaskedArray([1j, 2, X]) > > >>> i = a.imag > > >>> i[:] = MaskedArray([1, X, 1]) > > > > If we make the last line copy the mask to the original array, what > > should the real part of a[2] be? Conversely, if we don't copy > the mask, > > what should the imag part of a[1] be? It seems like we

might

...
> "want" the > > masks to be OR'd instead, but then should i[2] be masked after > we just > > set it to 1? > > > > Ah, I see the issue now... Easiest to implement and closest in analogy > > to a regular view would be to just let it unmask a[2] (with > whatever is > > in real; user beware!). > > > > Perhaps better would be to special-case such that `imag` returns a > > read-only view of the mask. Making `imag` itself read-only

would

...
> prevent > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -

but

...
> there is > > no reason this should update the mask. > > > > Still, neither is really satisfactory... > > > > > > > > > p.s. I started trying to implement the above "Mixin" class; will > > try to > > > clean that up a bit so that at least it uses `where`

and

...
> push it up. > > > > I played with "where", but didn't include it since 1.17 is not > released. > > To avoid duplication of effort, I've attached a diff of what I > tried. I > > actually get a slight slowdown of about 10% by using where... > > > > > > Your implementation is indeed quite similar to what I got in > > __array_ufunc__ (though one should "&" the where with

~mask).

...
> > > > I think the main benefit is not to presume that whatever is underneath > > understands 0 or 1, i.e., avoid filling. > > > > > > If you make progress with the mixin, a push is welcome.

I

...
> imagine a > > problem is going to be that np.isscalar doesn't work to detect > duck > > scalars. > > > > I fear that in my attempts I've simply decided that only array scalars > > exist... > > > > -- Marten > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Stephan Hoyer

3:03 a.m.

On Sat, Jun 22, 2019 at 6:50 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

+1, this sounds like the right model to me. That said, I would still not guarantee values under the mask as part of NumPy's API. The result of computations under the mask should be considered an undefined implementation detail, sort of like integer overflow or dict iteration order pre-Python 3.7. The values may even be entirely arbitrary, e.g., in cases where the result is preallocated with empty(). I'm less confident about the right way to handle missing elements in reductions. For example: - Should median() also skip missing elements, even though there is no identity element? - If reductions/aggregations default to skipping missing elements, how is it be possible to express "NA propagating" versions, which are also useful, if slightly less common? We may want to add a standard "skipna" argument on NumPy aggregations, solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere. With this mental picture, the underlying data are always have well-defined

...

meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com> wrote:

...
On 6/21/19 2:37 PM, Benjamin Root wrote:

...
Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers! Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind.

Best, Allan

...
On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > Hi Allan, > > This is very impressive! I could get the tests that I wrote for my class > pass with yours using Quantity with what I would consider very minimal > changes. I only could not find a good way to unmask data (I like the > idea of setting the mask on some elements via `ma[item] = X`); is this > on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment on your > repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

> > All the best, > > Marten > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > <snip> > > > > > This may be too much to ask from the initializer, but, if > so, it still > > > seems most useful if it is made as easy as possible to do, > say, `class > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > Currently MaskedArray does not accept ducktypes as underlying > arrays, > > but I think it shouldn't be too hard to modify it to do so. > Good idea! > > > > > > Looking back at my trial, I see that I also never got to duck arrays - > > only ndarray subclasses - though I tried to make the code as > agnostic as > > possible. > > > > (Trial at > > >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > > > I already partly navigated this mixin-issue in the > > "MaskedArrayCollection" class, which essentially does > > ArrayCollection(MaskedArray(array)), and only takes

about 30

...
> lines of > > boilerplate. That's the backwards encapsulation order

from

...
> what you want > > though. > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > > mask=[True, False, False])` does indeed not have a `.unit` attribute > > (and cannot represent itself...); I'm not at all sure that

my

...
> method of > > just creating a mixed class is anything but a recipe for disaster, > though! > > Based on your suggestion I worked on this a little today, and now my > MaskedArray more easily encapsulates both ducktypes and

ndarray

...
> subclasses (pushed to repo). Here's an example I got working with masked > units using unyt: > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > [2]: from unyt import m, km > > [3]: import numpy as np > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > [5]: uarr > > MaskedArray([1., X , 3.]) > [6]: uarr + 1*m > > MaskedArray([1.001, X , 3.001]) > [7]: uarr.filled() > > unyt_array([1., 0., 3.], 'km') > [8]: np.concatenate([uarr, 2*uarr]).filled() > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > The catch is the ducktype/subclass has to rigorously follow numpy's > indexing rules, including distinguishing 0d arrays from scalars. For now > only I used unyt in the example above since it happens to be less strict > about dimensionless operations than astropy.units which trips up my > repr code. (see below for example with astropy.units). Note in the last > line I lost the dimensions, but that is because unyt does not handle > np.concatenate. To get that to work we need a true ducktype for units. > > The example above doesn't expose the ".units" attribute outside the > MaskedArray, and it doesn't print the units in the repr. But you can > access them using "filled". > > While I could make MaskedArray forward unknown attribute accesses to the > encapsulated array, that seems a bit dangerous/bug-prone at

first

...
> glance, so probably I want to require the user to make a MaskedArray > subclass to do so. I've just started playing with that (probably buggy), > and Ive attached subclass examples for astropy.unit and unyt, with some > example output below. > > Cheers, > Allan > > > > Example using the attached astropy unit subclass: > > >>> from astropy.units import m, km, s > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > >>> uarr > MaskedQ([1., X , 1.], units=km) > >>> uarr.units > km > >>> uarr + (1*m) > MaskedQ([1.001, X , 1.001], units=km) > >>> uarr/(1*s) > MaskedQ([1., X , 1.], units=km / s) > >>> (uarr*(1*m))[1:] > MaskedQ([X , 1.], units=km m) > >>> np.add.outer(uarr, uarr) > MaskedQ([[2., X , 2.], > [X , X , X ], > [2., X , 2.]], units=km) > >>> print(uarr) > [1. X 1.] km m > > Cheers, > Allan > > > > > Even if this impossible, I think it is conceptually

useful

...
> to think > > > about what the masking class should do. My sense is

that,

...
> e.g., it > > > should not attempt to decide when an operation succeeds or not, > > but just > > > "or together" input masks for regular, multiple-input functions, > > and let > > > the underlying arrays skip elements for reductions by using > `where` > > > (hey, I did implement that for a reason... ;-). In > particular, it > > > suggests one should not have things like domains and

all

...
> that (I never > > > understood why `MaskedArray` did that). If one wants

more,

...
> the class > > > should provide a method that updates the mask (a

sensible

...
> default > > might > > > be `mask |= ~np.isfinite(result)` - here, the class being masked > > should > > > logically support ufuncs and functions, so it can decide what > > "isfinite" > > > means). > > > > I agree it would be nice to remove domains. It would make life > easier, > > and I could remove a lot of twiddly code! I kept it in for now to > > minimize the behavior changes from the old MaskedArray. > > > > > > That makes sense. Could be separated out to a backwards-compatibility > > class later. > > > > > > > In any case, I would think that a basic truth should be that > > everything > > > has a mask with a shape consistent with the data, so > > > 1. Each complex numbers has just one mask, and setting > `a.imag` with a > > > masked array should definitely propagate the mask. > > > 2. For a masked array with structured dtype, I'd similarly say > > that the > > > default is for a mask to have the same shape as the

array.

...
> But that > > > something like your collection makes sense for the

case

...
> where one > > wants > > > to mask items in a structure. > > > > Agreed that we should have a single bool per complex or structured > > element, and the mask shape is the same as the array

shape.

...
> That's how I > > implemented it. But there is still a problem with complex.imag > > assignment: > > > > >>> a = MaskedArray([1j, 2, X]) > > >>> i = a.imag > > >>> i[:] = MaskedArray([1, X, 1]) > > > > If we make the last line copy the mask to the original array, what > > should the real part of a[2] be? Conversely, if we don't copy > the mask, > > what should the imag part of a[1] be? It seems like we

might

...
> "want" the > > masks to be OR'd instead, but then should i[2] be masked after > we just > > set it to 1? > > > > Ah, I see the issue now... Easiest to implement and closest in analogy > > to a regular view would be to just let it unmask a[2] (with > whatever is > > in real; user beware!). > > > > Perhaps better would be to special-case such that `imag` returns a > > read-only view of the mask. Making `imag` itself read-only

would

...
> prevent > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -

but

...
> there is > > no reason this should update the mask. > > > > Still, neither is really satisfactory... > > > > > > > > > p.s. I started trying to implement the above "Mixin" class; will > > try to > > > clean that up a bit so that at least it uses `where`

and

...
> push it up. > > > > I played with "where", but didn't include it since 1.17 is not > released. > > To avoid duplication of effort, I've attached a diff of what I > tried. I > > actually get a slight slowdown of about 10% by using where... > > > > > > Your implementation is indeed quite similar to what I got in > > __array_ufunc__ (though one should "&" the where with

~mask).

...
> > > > I think the main benefit is not to presume that whatever is underneath > > understands 0 or 1, i.e., avoid filling. > > > > > > If you make progress with the mixin, a push is welcome.

I

...
> imagine a > > problem is going to be that np.isscalar doesn't work to detect > duck > > scalars. > > > > I fear that in my attempts I've simply decided that only array scalars > > exist... > > > > -- Marten > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

9:07 a.m.

...

I think a sensible alternative mental model for the MaskedArray class is

...
that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

+1, this sounds like the right model to me.

One small worry is about name clashes - ideally one wants the masked array to be somewhat of a drop-in for whatever class it is masking (independently of whether it is an actual subclass of it). In this respect, `.data` is a pretty terrible name (and, yes, the cause of lots of problems for astropy's MaskedColumn - not my fault, that one!). In my own trials, thinking that names that include "mask" are fair game, I've been considering a function `.unmask(fill_value=None)` which would replace both `.filled(fill_value)` and `.data` by having the default be not to fill anything (I don't see why a masked array should carry a fill value along; one might use specific strings such as 'minmax' for auto-generated cases). If wanted, one could then add `unmasked = property(unmask)`. Aside: my sense is to, at first at least, feel as unbound as possible from the current MaskedArray - one could then use whatever it is to try to create something that comes close to reproducing it, but only for ease of transition.

...

That said, I would still not guarantee values under the mask as part of NumPy's API. The result of computations under the mask should be considered an undefined implementation detail, sort of like integer overflow or dict iteration order pre-Python 3.7. The values may even be entirely arbitrary, e.g., in cases where the result is preallocated with empty().

I think that is reasonable. The use cases Ben and I described both are ones where the array is being used as input for a set of computations which differ only in their mask. (Admittedly, in both our cases one could just reinitialize a masked array with the new mask; but I think we share the mental model of that if I don't operate on the masked array, the data doesn't change, so I should just be able to change the mask.)

...

I'm less confident about the right way to handle missing elements in reductions. For example: - Should median() also skip missing elements, even though there is no identity element?

I think so. If for mean(), std(), etc., the number of unmasked elements comes into play, I don't see why it wouldn't for median(). - If reductions/aggregations default to skipping missing elements, how is

...

it be possible to express "NA propagating" versions, which are also useful, if slightly less common?

I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`. A larger issue is the accumulations. Personally, I think those are basically meaningless for masked arrays, as to me logically the result on the position of any masked item should be masked. But, if so, I do not see how the ones "beyond" it could not be masked as well. Since here the right answers seems at least unclear, my sense would be to refuse the temptation to guess (i.e., the user should just explicitly fill with ufunc.identity if this is the right thing to do in their case). I should add that I'm slightly torn about a similar, somewhat related issue: what should `np.minimum(a, b)` do for the case where either a or b is masked? Currently, one just treats this as a bin-op, so the result is masked, but one could argue that this ufunc is a bit like a 2-element reduction, and thus that the unmasked item should "win by default". Possibly, the answer should be different between `np.minimum` and `np.fmin` (since the two differ in how they propagate `NaN` as well - note that you don't include `fmin` and `fmax` in your coverage). We may want to add a standard "skipna" argument on NumPy aggregations,

...

solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.

It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful). All the best, Marten

Stephan Hoyer

12:05 p.m.

On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

- If reductions/aggregations default to skipping missing elements, how is

...
it be possible to express "NA propagating" versions, which are also useful, if slightly less common?

I have been playing with using a new `Mask(np.ndarray)` class for the mask, which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.

I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc. Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

...

We may want to add a standard "skipna" argument on NumPy aggregations,

...
solely for the benefit of duck arrays (and dtypes with missing values). But that could also be a source of confusion, especially if skipna=True refers only "true NA" values, not including NaN, which is used as an alias for NA in pandas and elsewhere.

It does seem `where` should suffice, no? If one wants to be super-fancy, we could allow it to be a callable, which, if a ufunc, gets used inside the loop (`where=np.isfinite` would be particularly useful).

Let me try to make the API issue more concrete. Suppose we have a MaskedArray with values [1, 2, NA]. How do I get: 1. The sum ignoring masked values, i.e., 3. 2. The sum that is tainted by masked values, i.e., NA. Here's how this works with existing array libraries: - With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum. - With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False). - With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought? Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

Marten van Kerkwijk

4:54 p.m.

Hi Stephan, In slightly changed order: Let me try to make the API issue more concrete. Suppose we have a

...

MaskedArray with values [1, 2, NA]. How do I get: 1. The sum ignoring masked values, i.e., 3. 2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries: - With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum. - With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False). - With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought?

I think this is all correct.

...

Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on

the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask. I have been playing with using a new `Mask(np.ndarray)` class for the mask,

...

...
which does the actual mask propagation (i.e., all single-operand ufuncs just copy the mask, binary operations do `logical_or` and reductions do `logical.and.reduce`). This way the `Masked` class itself can generally apply a given operation on the data and the masks separately and then combine the two results (reductions are the exception in that `where` has to be set). Your particular example here could be solved with a different `Mask` class, for which reductions do `logical.or.reduce`.

I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc.

Implicitly in the above, I agree with having the mask not necessarily be a plain ndarray, but something that can determine part of the action. Makes sense to generalize that to duck arrays for the reasons you give. Indeed, if we let the mask do the mask propagation as well, it might help make the implementation substantially easier (e.g., `logical_and.reduce` and `logical_or.reduce` can be super-fast on a bitmask!).

...

Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument). It may be good to collect a few more test cases... E.g., I'd like to mask some of the astropy classes that are only very partial duck arrays, in that they cover only the shape aspect, and which do have some operators and for which it would be nice not to feel forced to use __array_ufunc__. All the best, Marten

Stephan Hoyer

6:25 p.m.

On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

...
where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on

Your proposal would be something like np.sum(array, the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False). I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types. Are there use-cases for propagating masks separately from data? If not, it

...

...
might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument).

You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions. For example, compare these two implementations of add: def add1(x, y): return MaskedArray(x.data + y.data, x.mask | y.mask) def add2(x, y): return MaskedArray(x.data + y.data, x.mask + y.mask) The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects.

Eric Wieser

6:58 p.m.

I think we’d need to consider separately the operation on the mask and on the data. In my proposal, the data would always do np.sum(array, where=~mask), while how the mask would propagate might depend on the mask itself, I quite like this idea, and I think Stephan’s strawman design is actually plausible, where MaskedArray.mask is either an InvalidMask or a IgnoreMask instance to pick between the different propagation types. Both classes could simply have an underlying ._array attribute pointing to a duck-array of some kind that backs their boolean data. The second version requires that you *also* know how Mask classes work, and how they implement + I remain unconvinced that Mask classes should behave differently on different ufuncs. I don’t think np.minimum(ignore_na, b) is any different to np.add(ignore_na, b) - either both should produce b, or both should produce ignore_na. I would lean towards produxing ignore_na, and propagation behavior differing between “ignore” and “invalid” only for reduce / accumulate operations, where the concept of skipping an application is well-defined. Some possible follow-up questions that having two distinct masked types raise: - what if I want my data to support both invalid and skip fields at the same time? sum([invalid, skip, 1]) == invalid - is there a use case for more that these two types of mask? invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting things to track through a calculation, possibly a dictionary of named masks. Eric On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <shoyer@gmail.com> wrote:

...

On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...
Your proposal would be something like np.sum(array,

...
where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False).

I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types.

Are there use-cases for propagating masks separately from data? If not, it

...
...
might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at call with a masked index argument).

You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions.

For example, compare these two implementations of add:

def add1(x, y): return MaskedArray(x.data + y.data, x.mask | y.mask)

def add2(x, y): return MaskedArray(x.data + y.data, x.mask + y.mask)

The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

10:04 p.m.

Hi Stephan, Eric perhaps explained my concept better than I could! I do agree that, as written, your example would be clearer, but Allan's code and the current MaskedArray code do have not that much semblance to it, and mine even less, as they deal with operators as whole groups. For mine, it may be useful to tell that this quite possibly crazy idea came from considering how one would quite logically implement all shape operations, which effect the data and mask the same way, so one soon writes down something where (as in my ShapedLikeNDArray mixin; https://github.com/astropy/astropy/blob/master/astropy/utils/misc.py#L858) all methods pass on to `return self._apply(method, <arguments>), with `_apply` looking like: ``` def _apply(self, method, *args, **kwargs): # For use with NDArrayShapeMethods. data = getattr(self._data, method)(*args, **kwargs) mask = getattr(self._mask, method)(*args, **kwargs) return self.__class__(data, mask=mask) ``` (Or the same is auto-generated for all those methods.) Now considering this, for `__array_ufunc__`, one might similarly do (__call__ only, ignoring outputs) ``` def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): data = [] masks = [] for input_ in inputs: d, m = self._data_mask(input_) data.append(d) masks.append(m) result_mask = getattr(ufunc, method)(*masks, **mask_kwargs) if result_mask is NotImplemented: return NotImplemented result_data = getattr(ufunc, method)(*data, **kwargs) return self._masked_result(result_data, result_mask, out) ``` The beauty here is that the Masked class itself needs to know very little about how to do masking, or how to implement it; it is almost like a collection of arrays: the only thing that separates `__array_ufunc__` from `_apply` above is that, since other objects are involved, the data and mask have to be separated out for those. (And perhaps for people trying to change it, it actually helps that the special-casing of single-, double-, and triple-input ufuncs can all be done inside the mask class, and adjustments can be made there without having to understand the machinery for getting masks out of data, etc.?) But what brought me to this was not __array_ufunc__ itself, but rather the next step, that I really would like to have masked version of classes that are array-like in shape but do not generally work with numpy ufuncs or functions, and where I prefer not to force the use of numpy ufuncs. So, inside the masked class I have to override the operators. Obviously, there again I could do it with functions like you describe, but I can also just split off data and masks as above, and just call the same operator on both. Then, the code could basically use that of `__array_ufunc__` above when called as `self.__array_ufunc__(operator, '__add__', self, other)`. I think (but am not sure, not actually having tried it yet), that this would also make it easier to mostly auto-generated masked classes: those supported by both the mask and the data are relatively easy; those separate for the data need some hints from the data class (e.g., `.unit` on a `Quantity` is independent of the mask). Anyway, just to be clear, I am *not* sure that this is the right way forward. I think the more important really was your suggestion to take mask duck types into account a priori (for bit-packed ones, etc.). But it seems worth thinking it through now, so we don't repeat the things that ended up making `MaskedArray` quite annoying. All the best, Marten

Marten van Kerkwijk

10:27 p.m.

Hi Eric, On your other points: I remain unconvinced that Mask classes should behave differently on

...

different ufuncs. I don’t think np.minimum(ignore_na, b) is any different to np.add(ignore_na, b) - either both should produce b, or both should produce ignore_na. I would lean towards produxing ignore_na, and propagation behavior differing between “ignore” and “invalid” only for reduce / accumulate operations, where the concept of skipping an application is well-defined.

I think I mostly agree - this is really about reductions. And this fact that there are apparently only two choices weakens the case for pushing the logic into the mask class itself. But the one case that still tempts me to break with the strict rule for ufunc.__call__ is `fmin, fmax` vs `minimum, maximum`... What do you think?

...

Some possible follow-up questions that having two distinct masked types raise:

- what if I want my data to support both invalid and skip fields at the same time? sum([invalid, skip, 1]) == invalid

Have a triple-valued mask? Would be easy to implement if all the logic is in the mask...

(Indeed, for all I care it could implement weighting! That would actually care about the actual operation, so would be a real example. Though of course it also does need access to the actual data, so perhaps best not to go there...)

...

- is there a use case for more that these two types of mask? invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting things to track through a calculation, possibly a dictionary of named masks.

For astropy's NDData, there has been quite a bit of discussion of a

`Flags` object, which works exactly as you describe, an OR together of different reasons for why data is invalid (HST uses this, though the discussion was for the Large Synodic Survey Telescope data pipeline). Those flags are propagated like masks. I think in most cases, these examples would not require more than allowing the mask to be a duck type. Though perhaps for some reductions, it might matter what the reduction of the data is actually doing (e.g., `np.minimum.reduce` might need different rules than `np.add.reduce`). And, of course, one can argue that for such a case it might be best to subclass MaskedArray itself, and do the logic there. All the best, Marten p.s. For accumulations, I'm still not sure I find them well-defined. I could see that np.add.reduce([0, 1, 1, --, 3])` could lead to `[0, 1, 2, 5]`, i.e., a shorter sequence, but this doesn't work on arrays where different rows can have different numbers of masked elements. It then perhaps suggests `[0, 1, 2, --, 5]` is OK, but the annoyance I have is that there is nothing that tells me what the underlying data should be, i.e., this is truly different from having a `where` keyword in `np.add.reduce`. But perhaps it is that I do not see much use for accumulation beyond changing a histogram into its cumulative version - for which masked elements really make no sense - one somehow has to interpolate over, not just discount the masked one.

Allan Haldane

11:59 a.m.

On 6/23/19 6:58 PM, Eric Wieser wrote:

...

I think we’d need to consider separately the operation on the mask and on the data. In my proposal, the data would always do |np.sum(array, where=~mask)|, while how the mask would propagate might depend on the mask itself,

I quite like this idea, and I think Stephan’s strawman design is actually plausible, where |MaskedArray.mask| is either an |InvalidMask| or a |IgnoreMask| instance to pick between the different propagation types. Both classes could simply have an underlying |._array| attribute pointing to a duck-array of some kind that backs their boolean data.

The second version requires that you /also/ know how Mask classes work, and how they implement +

I remain unconvinced that Mask classes should behave differently on different ufuncs. I don’t think |np.minimum(ignore_na, b)| is any different to |np.add(ignore_na, b)| - either both should produce |b|, or both should produce |ignore_na|. I would lean towards produxing |ignore_na|, and propagation behavior differing between “ignore” and “invalid” only for |reduce| / |accumulate| operations, where the concept of skipping an application is well-defined.

Some possible follow-up questions that having two distinct masked types raise:

* what if I want my data to support both invalid and skip fields at the same time? sum([invalid, skip, 1]) == invalid * is there a use case for more that these two types of mask? |invalid_due_to_reason_A|, |invalid_due_to_reason_B| would be interesting things to track through a calculation, possibly a dictionary of named masks.

Eric

General comments on the last few emails: For now I intentionally decided not to worry about NA masks in my implementation. I want to get a first working implementation finished for IGNORE style. I agree it would be nice to have some support for NA style later, either by a new MaskedArray subclass, a ducktype'd .mask attribute, or by some other modification. In the latter category, consider that currently the mask is stored as a boolean (1 byte) mask. One idea I have not put much thought into is that internally we could make the mask a uint8, so unmasked would be "0" IGNORE mask would be 1, and NA mask would be 2. That allows mixing of mask types. Not sure how messy it would be to implement. For the idea of replacing the mask by a ducktype for NA style, my instinct is that would be tricky. Many of the MaskedArray __array_function__ method implementations, like sort and dot and many others, do "complicated" computations using the mask that I don't think you could easily get to work right by simply substituting a mask ducktype. I think those would need to be reimplemented for NA style, in other words you would need to make a MaskedArray subclass anyway. Cheers, Allan

...

On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer <shoyer@gmail.com <mailto:shoyer@gmail.com>> wrote:

On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <m.h.vankerkwijk@gmail.com <mailto:m.h.vankerkwijk@gmail.com>> wrote:

Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)

I think we'd need to consider separately the operation on the mask and on the data. In my proposal, the data would always do `np.sum(array, where=~mask)`, while how the mask would propagate might depend on the mask itself, i.e., we'd have different mask types for `skipna=True` (default) and `False` ("contagious") reductions, which differed in doing `logical_and.reduce` or `logical_or.reduce` on the mask.

OK, I think I finally understand what you're getting at. So suppose this this how we implement it internally. Would we really insist on a user creating a new MaskedArray with a new mask object, e.g., with a GreedyMask? We could add sugar for this, but certainly array.greedy_masked().sum() is significantly less clear than array.sum(skipna=False).

I'm also a little concerned about a proliferation of MaskedArray/Mask types. New types are significantly harder to understand than new functions (or new arguments on existing functions). I don't know if we have enough distinct use cases for this many types.

Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler.

I had only thought about separating out the concern of mask propagation from the "MaskedArray" class to the mask proper, but it might indeed make things easier if the mask also did any required preparation for passing things on to the data (such as adjusting the "where" argument in a reduction). I also like that this way the mask can determine even before the data what functionality is available (i.e., it could be the place from which to return `NotImplemented` for a ufunc.at <http://ufunc.at> call with a masked index argument).

You're going to have to come up with something more compelling than "separation of concerns" to convince me that this extra Mask abstraction is worthwhile. On its own, I think a separate Mask class would only obfuscate MaskedArray functions.

For example, compare these two implementations of add:

def add1(x, y): return MaskedArray(x.data + y.data, x.mask | y.mask)

def add2(x, y): return MaskedArray(x.data + y.data, x.mask + y.mask)

The second version requires that you *also* know how Mask classes work, and how they implement +. So now you need to look in at least twice as many places to understand add() for MaskedArray objects. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Aldcroft, Thomas

9:52 a.m.

On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask,

I'm generally on-board with that mental picture, and agree that the use-case described by Ben (different layers of satellite imagery) is important. Same thing happens in astronomy data, e.g. you have a CCD image of the sky and there are cosmic rays that contaminate the image. Those are not garbage data, just pixels that one wants to ignore in some, but not all, contexts. However, it's worth noting that one cannot blindly forward any operations to the data it holds since the operation may be illegal on that data. The simplest example is dividing `a / b` where `b` has data values of 0 but they are masked. That operation should succeed with no exception, and here the resultant value under the mask is genuinely garbage. The current MaskedArray seems a bit inconsistent in dealing with invalid calcuations. Dividing by 0 (if masked) is no problem and returns the numerator. Taking the log of a masked 0 gives the usual divide by zero RuntimeWarning and puts a 1.0 under the mask of the output. Perhaps the expression should not even be evaluated on elements where the output mask is True, and all the masked output data values should be set to a predictable value (e.g. zero for numerical, zero-length string for string, or maybe a default fill value). That at least provides consistent and predictable behavior that is simple to explain. Otherwise the story is that the data under the mask *might* be OK, unless for a particular element the computation was invalid in which case it is filled with some arbitrary value. I think that is actually an error-prone behavior that should be avoided. - Tom

...

ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com> wrote:

...
On 6/21/19 2:37 PM, Benjamin Root wrote:

...
Just to note, data that is masked isn't always garbage. There are plenty of use-cases where one may want to temporarily apply a mask for a set of computation, or possibly want to apply a series of different masks to the data. I haven't read through this discussion deeply enough, but is this new class going to destroy underlying masked data? and will it be possible to swap out masks?

Cheers! Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind.

Best, Allan

...
On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > Hi Allan, > > This is very impressive! I could get the tests that I wrote for my class > pass with yours using Quantity with what I would consider very minimal > changes. I only could not find a good way to unmask data (I like the > idea of setting the mask on some elements via `ma[item] = X`); is this > on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

> Anyway, it would seem easily at the point where I should comment on your > repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

Best, Allan

> > All the best, > > Marten > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > <snip> > > > > > This may be too much to ask from the initializer, but, if > so, it still > > > seems most useful if it is made as easy as possible to do, > say, `class > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > Currently MaskedArray does not accept ducktypes as underlying > arrays, > > but I think it shouldn't be too hard to modify it to do so. > Good idea! > > > > > > Looking back at my trial, I see that I also never got to duck arrays - > > only ndarray subclasses - though I tried to make the code as > agnostic as > > possible. > > > > (Trial at > > >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > > > I already partly navigated this mixin-issue in the > > "MaskedArrayCollection" class, which essentially does > > ArrayCollection(MaskedArray(array)), and only takes

about 30

...
> lines of > > boilerplate. That's the backwards encapsulation order

from

...
> what you want > > though. > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > > mask=[True, False, False])` does indeed not have a `.unit` attribute > > (and cannot represent itself...); I'm not at all sure that

my

...
> method of > > just creating a mixed class is anything but a recipe for disaster, > though! > > Based on your suggestion I worked on this a little today, and now my > MaskedArray more easily encapsulates both ducktypes and

ndarray

...
> subclasses (pushed to repo). Here's an example I got working with masked > units using unyt: > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > [2]: from unyt import m, km > > [3]: import numpy as np > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > [5]: uarr > > MaskedArray([1., X , 3.]) > [6]: uarr + 1*m > > MaskedArray([1.001, X , 3.001]) > [7]: uarr.filled() > > unyt_array([1., 0., 3.], 'km') > [8]: np.concatenate([uarr, 2*uarr]).filled() > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > The catch is the ducktype/subclass has to rigorously follow numpy's > indexing rules, including distinguishing 0d arrays from scalars. For now > only I used unyt in the example above since it happens to be less strict > about dimensionless operations than astropy.units which trips up my > repr code. (see below for example with astropy.units). Note in the last > line I lost the dimensions, but that is because unyt does not handle > np.concatenate. To get that to work we need a true ducktype for units. > > The example above doesn't expose the ".units" attribute outside the > MaskedArray, and it doesn't print the units in the repr. But you can > access them using "filled". > > While I could make MaskedArray forward unknown attribute accesses to the > encapsulated array, that seems a bit dangerous/bug-prone at

first

...
> glance, so probably I want to require the user to make a MaskedArray > subclass to do so. I've just started playing with that (probably buggy), > and Ive attached subclass examples for astropy.unit and unyt, with some > example output below. > > Cheers, > Allan > > > > Example using the attached astropy unit subclass: > > >>> from astropy.units import m, km, s > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > >>> uarr > MaskedQ([1., X , 1.], units=km) > >>> uarr.units > km > >>> uarr + (1*m) > MaskedQ([1.001, X , 1.001], units=km) > >>> uarr/(1*s) > MaskedQ([1., X , 1.], units=km / s) > >>> (uarr*(1*m))[1:] > MaskedQ([X , 1.], units=km m) > >>> np.add.outer(uarr, uarr) > MaskedQ([[2., X , 2.], > [X , X , X ], > [2., X , 2.]], units=km) > >>> print(uarr) > [1. X 1.] km m > > Cheers, > Allan > > > > > Even if this impossible, I think it is conceptually

useful

...
> to think > > > about what the masking class should do. My sense is

that,

...
> e.g., it > > > should not attempt to decide when an operation succeeds or not, > > but just > > > "or together" input masks for regular, multiple-input functions, > > and let > > > the underlying arrays skip elements for reductions by using > `where` > > > (hey, I did implement that for a reason... ;-). In > particular, it > > > suggests one should not have things like domains and

all

...
> that (I never > > > understood why `MaskedArray` did that). If one wants

more,

...
> the class > > > should provide a method that updates the mask (a

sensible

...
> default > > might > > > be `mask |= ~np.isfinite(result)` - here, the class being masked > > should > > > logically support ufuncs and functions, so it can decide what > > "isfinite" > > > means). > > > > I agree it would be nice to remove domains. It would make life > easier, > > and I could remove a lot of twiddly code! I kept it in for now to > > minimize the behavior changes from the old MaskedArray. > > > > > > That makes sense. Could be separated out to a backwards-compatibility > > class later. > > > > > > > In any case, I would think that a basic truth should be that > > everything > > > has a mask with a shape consistent with the data, so > > > 1. Each complex numbers has just one mask, and setting > `a.imag` with a > > > masked array should definitely propagate the mask. > > > 2. For a masked array with structured dtype, I'd similarly say > > that the > > > default is for a mask to have the same shape as the

array.

...
> But that > > > something like your collection makes sense for the

case

...
> where one > > wants > > > to mask items in a structure. > > > > Agreed that we should have a single bool per complex or structured > > element, and the mask shape is the same as the array

shape.

...
> That's how I > > implemented it. But there is still a problem with complex.imag > > assignment: > > > > >>> a = MaskedArray([1j, 2, X]) > > >>> i = a.imag > > >>> i[:] = MaskedArray([1, X, 1]) > > > > If we make the last line copy the mask to the original array, what > > should the real part of a[2] be? Conversely, if we don't copy > the mask, > > what should the imag part of a[1] be? It seems like we

might

...
> "want" the > > masks to be OR'd instead, but then should i[2] be masked after > we just > > set it to 1? > > > > Ah, I see the issue now... Easiest to implement and closest in analogy > > to a regular view would be to just let it unmask a[2] (with > whatever is > > in real; user beware!). > > > > Perhaps better would be to special-case such that `imag` returns a > > read-only view of the mask. Making `imag` itself read-only

would

...
> prevent > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` -

but

...
> there is > > no reason this should update the mask. > > > > Still, neither is really satisfactory... > > > > > > > > > p.s. I started trying to implement the above "Mixin" class; will > > try to > > > clean that up a bit so that at least it uses `where`

and

...
> push it up. > > > > I played with "where", but didn't include it since 1.17 is not > released. > > To avoid duplication of effort, I've attached a diff of what I > tried. I > > actually get a slight slowdown of about 10% by using where... > > > > > > Your implementation is indeed quite similar to what I got in > > __array_ufunc__ (though one should "&" the where with

~mask).

...
> > > > I think the main benefit is not to presume that whatever is underneath > > understands 0 or 1, i.e., avoid filling. > > > > > > If you make progress with the mixin, a push is welcome.

I

...
> imagine a > > problem is going to be that np.isscalar doesn't work to detect > duck > > scalars. > > > > I fear that in my attempts I've simply decided that only array scalars > > exist... > > > > -- Marten > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

11:11 a.m.

Hi Tom, I think a sensible alternative mental model for the MaskedArray class is

...

...
that all it does is forward any operations to the data it holds and separately propagate a mask,

I'm generally on-board with that mental picture, and agree that the use-case described by Ben (different layers of satellite imagery) is important. Same thing happens in astronomy data, e.g. you have a CCD image of the sky and there are cosmic rays that contaminate the image. Those are not garbage data, just pixels that one wants to ignore in some, but not all, contexts.

However, it's worth noting that one cannot blindly forward any operations to the data it holds since the operation may be illegal on that data. The simplest example is dividing `a / b` where `b` has data values of 0 but they are masked. That operation should succeed with no exception, and here the resultant value under the mask is genuinely garbage.

Even in the present implementation, the operation is just forwarded, with numpy errstate set to ignore all errors. And then after the fact some half-hearted remediation is done.

...

The current MaskedArray seems a bit inconsistent in dealing with invalid calcuations. Dividing by 0 (if masked) is no problem and returns the numerator. Taking the log of a masked 0 gives the usual divide by zero RuntimeWarning and puts a 1.0 under the mask of the output.

Perhaps the expression should not even be evaluated on elements where the output mask is True, and all the masked output data values should be set to a predictable value (e.g. zero for numerical, zero-length string for string, or maybe a default fill value). That at least provides consistent and predictable behavior that is simple to explain. Otherwise the story is that the data under the mask *might* be OK, unless for a particular element the computation was invalid in which case it is filled with some arbitrary value. I think that is actually an error-prone behavior that should be avoided.

I think I agree with Allan here, that after a computation, one generally simply cannot safely assume anything for masked elements. But it is reasonable for subclasses to define what they want to do "post-operation"; e.g., for numerical arrays, it might make generally make sense to do ``` notok = ~np.isfinite(result) mask |= notok ``` and one could then also do ``` result[notok] = fill_value ``` But I think one might want to leave that to the user. All the best, Marten

Allan Haldane

11:46 a.m.

On 6/22/19 11:50 AM, Marten van Kerkwijk wrote:

...

Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string? > With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not.

All the best,

Marten

I want to distinguish 3 different behaviors we are discussing: 1. Making a "no-clobber" guarantee on the underlying data 2. whether the data under the mask should have "sensible" values 3. whether to allow "unmasking" 1. For "no clobber" =================== I agree this would be a nice guarantee to make, as long as it does not impose too much burden on us. Sometimes it is better not to chain ourselves down. By using "where", I can indeed make many api-functions and ufuncs guarantee no-clobber. There are still a bunch of functions that seem tricky to implement currently without either clobbering or making a copy: dot, cross, outer, sort/argsort/searchsorted, correlate, convolve, nonzero, and similar functions. I'll think about how to implement those in a no-clobber way. Any suggestions welcome, eg for np.dot/outer. A no-clobber guarantee makes your "iterative mask" example solvable in an efficient (no-copy) way: mask, last_mask = False while True: dat_mean = np.mean(MaskedArray(data, mask)) mask, last_mask = np.abs(data - mask) > cutoff, mask if np.all(mask == last_mask): break The MaskedArray constructor should have pretty minimal overhead. 2. Whether we can make "masked" data keep sensible values ========================================================= I'm not confident this is a good guarantee to make. Certainly in a simple case like c = a + b we can make masked values in c contain the correct sum of the masked data in a and b. But I foresee complications in other cases. For instance, MaskedArray([4,4,4])/MaskedArray([0, 1, 2], mask=[1,0,1]) If we use the "where" ufunc argument to evaluate the division operation, a division-by-0 warning is not output which is good. However, if we use "where" index 2 does not get updated correctly and will contain "nonsense". If we use "where" a lot (which I think we should) we can expect a lot of uninitialized masked values to commonly appear. So my interpretation is that this comment:

...

I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask

should only roughly be true, in the sense that we will not simply "forward any operations" but we will also use "where" arguments which produce nonsense masked values. 3. Whether to allow unmasking ============================= If we agree that masked values will contain nonsense, it seems like a bad idea for those values to be easily exposed. Further, in all the comments so far I have not seen an example of a need for unmasking that is not more easily, efficiently and safely achieved by simply creating a new MaskedArray with a different mask. If super-users want to access the ._data attribute they can, but I don't think it should be recommended. Normal users can use the ".filled" method, which by the way I implemented to optionally support returning a readonly view rather than a copy (the default). Cheers, Allan

...

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/21/19 2:37 PM, Benjamin Root wrote: > Just to note, data that is masked isn't always garbage. There are plenty > of use-cases where one may want to temporarily apply a mask for a set of > computation, or possibly want to apply a series of different masks to > the data. I haven't read through this discussion deeply enough, but is > this new class going to destroy underlying masked data? and will it be > possible to swap out masks? > > Cheers! > Ben Root

Indeed my implementation currently feels free to clobber the data at masked positions and makes no guarantees not to.

I'd like to try to support reasonable use-cases like yours though. A few thoughts:

First, the old np.ma.MaskedArray explicitly does not promise to preserve masked values, with a big warning in the docs. I can't recall the examples, but I remember coming across cases where clobbering happens. So arguably your behavior was never supported, and perhaps this means that no-clobber behavior is difficult to reasonably support.

Second, the old np.ma.MaskedArray avoids frequent clobbering by making lots of copies. Therefore, in most cases you will not lose any performance in my new MaskedArray relative to the old one by making an explicit copy yourself. I.e, is it problematic to have to do

>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask >>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often "effectively" clobbered, in the sense that they are not computed. For example, if you do "c = a+b", and then change the mask of c, the values at masked position of the result of (a+b) do not correspond to the sum of the masked values in a and b. Thus, by "unmasking" c you are exposing nonsense values, which to me seems likely to cause heisenbugs.

In summary, by not making no-clobber guarantees and by strictly preventing exposure of nonsense values, I suspect that: 1. my new code is simpler and faster by avoiding lots of copies, and forces copies to be explicit in user code. 2. disallowing direct modification of the mask lowers the "API surface area" making people's MaskedArray code less buggy and easier to read: Exposure of nonsense values by "unmasking" is one less possibility to keep in mind.

Best, Allan

> On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > > On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > > Hi Allan, > > > > This is very impressive! I could get the tests that I wrote for my > class > > pass with yours using Quantity with what I would consider very minimal > > changes. I only could not find a good way to unmask data (I like the > > idea of setting the mask on some elements via `ma[item] = X`); is this > > on purpose? > > Yes, I want to make it difficult for the user to access the garbage > values under the mask, which are often clobbered values. The only way to > "remove" a masked value is by replacing it with a new non-masked value. > > > > Anyway, it would seem easily at the point where I should comment > on your > > repository rather than in the mailing list! > > To make further progress on this encapsulation idea I need a more > complete ducktype to pass into MaskedArray to test, so that's what I'll > work on next, when I have time. I'll either try to finish my > ArrayCollection type, or try making a simple NDunit ducktype > piggybacking on astropy's Unit. > > Best, > Allan > > > > > > All the best, > > > > Marten > > > > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>> > wrote: > > > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > > <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> > > > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>>>> > > wrote: > > > <snip> > > > > > > > This may be too much to ask from the initializer, but, if > > so, it still > > > > seems most useful if it is made as easy as possible to do, > > say, `class > > > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > > > > > Currently MaskedArray does not accept ducktypes as > underlying > > arrays, > > > but I think it shouldn't be too hard to modify it to do so. > > Good idea! > > > > > > > > > Looking back at my trial, I see that I also never got to > duck arrays - > > > only ndarray subclasses - though I tried to make the code as > > agnostic as > > > possible. > > > > > > (Trial at > > > > > > https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?...) > > > > > > I already partly navigated this mixin-issue in the > > > "MaskedArrayCollection" class, which essentially does > > > ArrayCollection(MaskedArray(array)), and only takes about 30 > > lines of > > > boilerplate. That's the backwards encapsulation order from > > what you want > > > though. > > > > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * > u.m, > > > mask=[True, False, False])` does indeed not have a `.unit` > attribute > > > (and cannot represent itself...); I'm not at all sure that my > > method of > > > just creating a mixed class is anything but a recipe for > disaster, > > though! > > > > Based on your suggestion I worked on this a little today, and > now my > > MaskedArray more easily encapsulates both ducktypes and ndarray > > subclasses (pushed to repo). Here's an example I got working > with masked > > units using unyt: > > > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > > > [2]: from unyt import m, km > > > > [3]: import numpy as np > > > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > > > [5]: uarr > > > > MaskedArray([1., X , 3.]) > > [6]: uarr + 1*m > > > > MaskedArray([1.001, X , 3.001]) > > [7]: uarr.filled() > > > > unyt_array([1., 0., 3.], 'km') > > [8]: np.concatenate([uarr, 2*uarr]).filled() > > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > > > The catch is the ducktype/subclass has to rigorously follow > numpy's > > indexing rules, including distinguishing 0d arrays from > scalars. For now > > only I used unyt in the example above since it happens to be > less strict > > about dimensionless operations than astropy.units which trips > up my > > repr code. (see below for example with astropy.units). Note in > the last > > line I lost the dimensions, but that is because unyt does not > handle > > np.concatenate. To get that to work we need a true ducktype > for units. > > > > The example above doesn't expose the ".units" attribute > outside the > > MaskedArray, and it doesn't print the units in the repr. But > you can > > access them using "filled". > > > > While I could make MaskedArray forward unknown attribute > accesses to the > > encapsulated array, that seems a bit dangerous/bug-prone at first > > glance, so probably I want to require the user to make a > MaskedArray > > subclass to do so. I've just started playing with that > (probably buggy), > > and Ive attached subclass examples for astropy.unit and unyt, > with some > > example output below. > > > > Cheers, > > Allan > > > > > > > > Example using the attached astropy unit subclass: > > > > >>> from astropy.units import m, km, s > > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > > >>> uarr > > MaskedQ([1., X , 1.], units=km) > > >>> uarr.units > > km > > >>> uarr + (1*m) > > MaskedQ([1.001, X , 1.001], units=km) > > >>> uarr/(1*s) > > MaskedQ([1., X , 1.], units=km / s) > > >>> (uarr*(1*m))[1:] > > MaskedQ([X , 1.], units=km m) > > >>> np.add.outer(uarr, uarr) > > MaskedQ([[2., X , 2.], > > [X , X , X ], > > [2., X , 2.]], units=km) > > >>> print(uarr) > > [1. X 1.] km m > > > > Cheers, > > Allan > > > > > > > > Even if this impossible, I think it is conceptually useful > > to think > > > > about what the masking class should do. My sense is that, > > e.g., it > > > > should not attempt to decide when an operation > succeeds or not, > > > but just > > > > "or together" input masks for regular, multiple-input > functions, > > > and let > > > > the underlying arrays skip elements for reductions by > using > > `where` > > > > (hey, I did implement that for a reason... ;-). In > > particular, it > > > > suggests one should not have things like domains and all > > that (I never > > > > understood why `MaskedArray` did that). If one wants more, > > the class > > > > should provide a method that updates the mask (a sensible > > default > > > might > > > > be `mask |= ~np.isfinite(result)` - here, the class > being masked > > > should > > > > logically support ufuncs and functions, so it can > decide what > > > "isfinite" > > > > means). > > > > > > I agree it would be nice to remove domains. It would > make life > > easier, > > > and I could remove a lot of twiddly code! I kept it in > for now to > > > minimize the behavior changes from the old MaskedArray. > > > > > > > > > That makes sense. Could be separated out to a > backwards-compatibility > > > class later. > > > > > > > > > > In any case, I would think that a basic truth should > be that > > > everything > > > > has a mask with a shape consistent with the data, so > > > > 1. Each complex numbers has just one mask, and setting > > `a.imag` with a > > > > masked array should definitely propagate the mask. > > > > 2. For a masked array with structured dtype, I'd > similarly say > > > that the > > > > default is for a mask to have the same shape as the array. > > But that > > > > something like your collection makes sense for the case > > where one > > > wants > > > > to mask items in a structure. > > > > > > Agreed that we should have a single bool per complex or > structured > > > element, and the mask shape is the same as the array shape. > > That's how I > > > implemented it. But there is still a problem with > complex.imag > > > assignment: > > > > > > >>> a = MaskedArray([1j, 2, X]) > > > >>> i = a.imag > > > >>> i[:] = MaskedArray([1, X, 1]) > > > > > > If we make the last line copy the mask to the original > array, what > > > should the real part of a[2] be? Conversely, if we don't > copy > > the mask, > > > what should the imag part of a[1] be? It seems like we might > > "want" the > > > masks to be OR'd instead, but then should i[2] be masked > after > > we just > > > set it to 1? > > > > > > Ah, I see the issue now... Easiest to implement and closest > in analogy > > > to a regular view would be to just let it unmask a[2] (with > > whatever is > > > in real; user beware!). > > > > > > Perhaps better would be to special-case such that `imag` > returns a > > > read-only view of the mask. Making `imag` itself read-only would > > prevent > > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but > > there is > > > no reason this should update the mask. > > > > > > Still, neither is really satisfactory... > > > > > > > > > > > > > p.s. I started trying to implement the above "Mixin" > class; will > > > try to > > > > clean that up a bit so that at least it uses `where` and > > push it up. > > > > > > I played with "where", but didn't include it since 1.17 > is not > > released. > > > To avoid duplication of effort, I've attached a diff of > what I > > tried. I > > > actually get a slight slowdown of about 10% by using > where... > > > > > > > > > Your implementation is indeed quite similar to what I got in > > > __array_ufunc__ (though one should "&" the where with ~mask). > > > > > > I think the main benefit is not to presume that whatever is > underneath > > > understands 0 or 1, i.e., avoid filling. > > > > > > > > > If you make progress with the mixin, a push is welcome. I > > imagine a > > > problem is going to be that np.isscalar doesn't work to > detect > > duck > > > scalars. > > > > > > I fear that in my attempts I've simply decided that only > array scalars > > > exist... > > > > > > -- Marten > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>>> > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> <mailto:NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org>> > https://mail.python.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Allan Haldane

12:05 p.m.

On 6/24/19 11:46 AM, Allan Haldane wrote:

...

A no-clobber guarantee makes your "iterative mask" example solvable in an efficient (no-copy) way:

mask, last_mask = False while True: dat_mean = np.mean(MaskedArray(data, mask)) mask, last_mask = np.abs(data - mask) > cutoff, mask if np.all(mask == last_mask): break

Whoops, that should read "np.abs(data - dat_mean)" in there. Allan

Stephan Hoyer

12:16 p.m.

On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane <allanhaldane@gmail.com> wrote:

...

1. Making a "no-clobber" guarantee on the underlying data

Hi Allan -- could kindly clarify what you mean by "no-clobber"? Is this referring to allowing masked arrays to mutate masked data values in-place, even on apparently non-in-place operators? If so, that definitely seems like a bad idea to me. I would much rather do an unnecessary copy than have surprisingly non-thread-safe operations.

...

If we agree that masked values will contain nonsense, it seems like a bad idea for those values to be easily exposed.

Further, in all the comments so far I have not seen an example of a need for unmasking that is not more easily, efficiently and safely achieved by simply creating a new MaskedArray with a different mask.

My understanding is that essentially every low-level MaskedArray function is implemented by looking at the data and mask separately. If so, we should definitely expose this API directly to users (as part of the public API for MaskedArray), so they can write their own efficient algorithms. As a concrete example, suppose I wanted to implement a low-level "grouped mean" operation for masked arrays like that found in pandas. This isn't a builtin NumPy function, so I would need to write this myself. This would be relatively straightforward to do in Numba or Cython with raw NumPy arrays (e.g., see my example here for a NaN skipping version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py), but to do it efficiently you definitely don't want to make an unnecessary copy. The usual reason for hiding implementation details is when we want to reserve the right to change them. But if we're sure about the data model (which I think we are for MaskedArray) then I think there's a lot of value in exposing it directly to users, even if it's lower level than it appropriate to use in most cases.

Allan Haldane

1:38 p.m.

On 6/24/19 12:16 PM, Stephan Hoyer wrote:

...

On Mon, Jun 24, 2019 at 8:46 AM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

1. Making a "no-clobber" guarantee on the underlying data

Hi Allan -- could kindly clarify what you mean by "no-clobber"?

Is this referring to allowing masked arrays to mutate masked data values in-place, even on apparently non-in-place operators? If so, that definitely seems like a bad idea to me. I would much rather do an unnecessary copy than have surprisingly non-thread-safe operations.

Yes. In my current implementation, the operation: >>> a = np.arange(6) >>> m = MaskedArray(a, mask=a < 3) >>> res = np.dot(m, m) will clobber elements of a. It appears that to avoid clobbering we will need to have np.dot make a copy. I also discuss how my implementation clobbers views in the docs: https://github.com/ahaldane/ndarray_ducktypes/blob/master/doc/MaskedArray.md... I expect I could be convinced to make a no-clobber guarantee, if others agree it is better to accept the performance loss by making a copy internally. I just still have a hard time thinking of cases where clobbering is really that confusing, or easily avoidable by the user making an explicit copy. I like giving the user control over whether a copy is made or not, since I expect in the vast majority of cases a copy is unnecessary. I think it will be rare usage for people to hold on to the data array ("a" in the example above). Most of the time you create the MaskedArray on data created on the spot which you never touch directly again. We are all already used to numpy's "view" behavior (eg, for the np.array function), where if you don't explicitly make a copy of your orginal array you can expect further operations to modify it. Admittedly for MaskedArray it's a bit different since apparently readonly operations like np.dot can clobber, but again, it doesn't seem hard to know about or burdensome to avoid by explicit copy, and can give big performance advantages.

...

If we agree that masked values will contain nonsense, it seems like a bad idea for those values to be easily exposed.

Further, in all the comments so far I have not seen an example of a need for unmasking that is not more easily, efficiently and safely achieved by simply creating a new MaskedArray with a different mask.

My understanding is that essentially every low-level MaskedArray function is implemented by looking at the data and mask separately. If so, we should definitely expose this API directly to users (as part of the public API for MaskedArray), so they can write their own efficient algorithms.> As a concrete example, suppose I wanted to implement a low-level "grouped mean" operation for masked arrays like that found in pandas. This isn't a builtin NumPy function, so I would need to write this myself. This would be relatively straightforward to do in Numba or Cython with raw NumPy arrays (e.g., see my example here for a NaN skipping version: https://github.com/shoyer/numbagg/blob/v0.1.0/numbagg/grouped.py), but to do it efficiently you definitely don't want to make an unnecessary copy.

The usual reason for hiding implementation details is when we want to reserve the right to change them. But if we're sure about the data model (which I think we are for MaskedArray) then I think there's a lot of value in exposing it directly to users, even if it's lower level than it appropriate to use in most cases.

Fair enough, I think it is all right to allow people access to ._data and make some guarantees about it if they are implementing subclasses or defining new ducktypes. There should be a section in the documentation describing what guanantees we make about ._data (or ._array if we change the name) and how/when to use it. Best, Allan

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Marten van Kerkwijk

3:09 p.m.

Hi Allan, Thanks for bringing up the noclobber explicitly (and Stephan for asking for clarification; I was similarly confused). It does clarify the difference in mental picture. In mine, the operation would indeed be guaranteed to be done on the underlying data, without copy and without `.filled(...)`. I should clarify further that I use `where` only to skip reading elements (i.e., in reductions), not writing elements (as you mention, the unwritten element will often be nonsense - e.g., wrong units - which to me is worse than infinity or something similar; I've not worried at all about runtime warnings). Note that my main reason here is not that I'm against filling with numbers for numerical arrays, but rather wanting to make minimal assumptions about the underlying data itself. This may well be a mistake (but I want to find out where it breaks). Anyway, it would seem in many ways all the better that our models are quite different. I definitely see the advantages of your choice to decide one can do with masked data elements whatever is logical ahead of an operation! Thanks also for bringing up a useful example with `np.dot(m, m)` - clearly, I didn't yet get beyond overriding ufuncs! In my mental model, where I'd apply `np.dot` on the data and the mask separately, the result will be wrong, so the mask has to be set (which it would be). For your specific example, that might not be the best solution, but when using `np.dot(matrix_shaped, matrix_shaped)`, I think it does give the correct masking: any masked element in a matrix better propagate to all parts that it influences, even if there is a reduction of sorts happening. So, perhaps a price to pay for a function that tries to do multiple things. The alternative solution in my model would be to replace `np.dot` with a masked-specific implementation of what `np.dot` is supposed to stand for (in your simple example, `np.add.reduce(np.multiply(m, m))` - more generally, add relevant `outer` and `axes`). This would be similar to what I think all implementations do for `.mean()` - we cannot calculate that from the data using any fill value or skipping, so rather use a more easily cared-for `.sum()` and divide by a suitable number of elements. But in both examples the disadvantage is that we took away the option to use the underlying class's `.dot()` or `.mean()` implementations. (Aside: considerations such as these underlie my longed-for exposure of standard implementations of functions in terms of basic ufunc calls.) Another example of a function for which I think my model is not particularly insightful (and for which it is difficult to know what to do generally) is `np.fft.fft`. Since an fft is equivalent to a sine/cosine fits to data points, the answer for masked data is in principle quite well-defined. But much less easy to implement! All the best, Marten

Eric Firing

4:40 p.m.

On 2019/06/24 9:09 AM, Marten van Kerkwijk wrote:

...

Another example of a function for which I think my model is not particularly insightful (and for which it is difficult to know what to do generally) is `np.fft.fft`. Since an fft is equivalent to a sine/cosine fits to data points, the answer for masked data is in principle quite well-defined. But much less easy to implement!

How is it well-defined? If you have N points of which M are masked, you have a vector with N-M pieces of information in the time domain. That can't be *uniquely* mapped to N points in the frequency domain. I think fft applied to a MaskedArray input should raise an exception, at least if there are any masked points. It must be left to the user to decide how to handle the masked points, e.g. fill with zero, or with the mean, or with linear interpolation, etc. Eric

Marten van Kerkwijk

5:39 p.m.

Hi Eric, The easiest definitely is for the mask to just propagate, which that even if just one point is masked, all points in the fft will be masked. On the direct point I made, I think it is correct that since one can think of the Fourier transform of a sine/cosine fit, then there is a solution even in the presence of some masked data, and this solution is distinct from that for a specific choice of fill value. But of course it is also true that the solution will be at least partially degenerate in its result and possibly indeterminate (e.g., for the extreme example of a real transform for which all but the first point are masked, all cosine term amplitudes are equal to the value of the first term, and are completely degenerate with each other, and all sine term amplitudes are indeterminate; one has only one piece of information, after all). Yet the inverse of any of those choices reproduces the input. That said, clearly there is a choice to be made whether this solution is at all interesting, which means that you are right that it needs an explicit user decision. All the best, Marten

Warren Weckesser

5:50 p.m.

On 6/24/19, Marten van Kerkwijk <m.h.vankerkwijk@gmail.com> wrote:

...

Hi Eric,

The easiest definitely is for the mask to just propagate, which that even if just one point is masked, all points in the fft will be masked.

On the direct point I made, I think it is correct that since one can think of the Fourier transform of a sine/cosine fit, then there is a solution even in the presence of some masked data, and this solution is distinct from that for a specific choice of fill value. But of course it is also true that the solution will be at least partially degenerate in its result and possibly indeterminate (e.g., for the extreme example of a real transform for which all but the first point are masked, all cosine term amplitudes are equal to the value of the first term, and are completely degenerate with each other, and all sine term amplitudes are indeterminate; one has only one piece of information, after all). Yet the inverse of any of those choices reproduces the input. That said, clearly there is a choice to be made whether this solution is at all interesting, which means that you are right that it needs an explicit user decision.

FWIW: The discrete Fourier transform is equivalent to a matrix multiplication (https://en.wikipedia.org/wiki/DFT_matrix, https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.dft.html), so whatever behavior you define for a nonmasked array times a masked vector also applies to the FFT of a masked vector. Warren

...

All the best,

Marten

Eric Firing

5:52 p.m.

On 2019/06/24 11:39 AM, Marten van Kerkwijk wrote:

...

Hi Eric,

The easiest definitely is for the mask to just propagate, which that even if just one point is masked, all points in the fft will be masked.

This is perfectly reasonable, and consistent with what happens with nans, of course. My suggestion of raising an Exception is probably not a good idea, as I realized shortly after sending the message. As a side note, I am happy to see the current burst of effort toward improved MaskedArray functionality, and very grateful for the work you, Allan, and others are doing in that regard. Eric

...

On the direct point I made, I think it is correct that since one can think of the Fourier transform of a sine/cosine fit, then there is a solution even in the presence of some masked data, and this solution is distinct from that for a specific choice of fill value. But of course it is also true that the solution will be at least partially degenerate in its result and possibly indeterminate (e.g., for the extreme example of a real transform for which all but the first point are masked, all cosine term amplitudes are equal to the value of the first term, and are completely degenerate with each other, and all sine term amplitudes are indeterminate; one has only one piece of information, after all). Yet the inverse of any of those choices reproduces the input. That said, clearly there is a choice to be made whether this solution is at all interesting, which means that you are right that it needs an explicit user decision.

All the best,

Marten

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Charles R Harris

5:57 p.m.

On Mon, Jun 24, 2019 at 3:40 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

Hi Eric,

The easiest definitely is for the mask to just propagate, which that even if just one point is masked, all points in the fft will be masked.

On the direct point I made, I think it is correct that since one can think of the Fourier transform of a sine/cosine fit, then there is a solution even in the presence of some masked data, and this solution is distinct from that for a specific choice of fill value. But of course it is also true that the solution will be at least partially degenerate in its result and possibly indeterminate (e.g., for the extreme example of a real transform for which all but the first point are masked, all cosine term amplitudes are equal to the value of the first term, and are completely degenerate with each other, and all sine term amplitudes are indeterminate; one has only one piece of information, after all). Yet the inverse of any of those choices reproduces the input. That said, clearly there is a choice to be made whether this solution is at all interesting, which means that you are right that it needs an explicit user decision.

Might be a good fit with the NUFFT <https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-n...> . Chuck

Allan Haldane

6:55 p.m.

On 6/24/19 3:09 PM, Marten van Kerkwijk wrote:

...

Hi Allan,

Thanks for bringing up the noclobber explicitly (and Stephan for asking for clarification; I was similarly confused).

It does clarify the difference in mental picture. In mine, the operation would indeed be guaranteed to be done on the underlying data, without copy and without `.filled(...)`. I should clarify further that I use `where` only to skip reading elements (i.e., in reductions), not writing elements (as you mention, the unwritten element will often be nonsense - e.g., wrong units - which to me is worse than infinity or something similar; I've not worried at all about runtime warnings). Note that my main reason here is not that I'm against filling with numbers for numerical arrays, but rather wanting to make minimal assumptions about the underlying data itself. This may well be a mistake (but I want to find out where it breaks).

Anyway, it would seem in many ways all the better that our models are quite different. I definitely see the advantages of your choice to decide one can do with masked data elements whatever is logical ahead of an operation!

Thanks also for bringing up a useful example with `np.dot(m, m)` - clearly, I didn't yet get beyond overriding ufuncs!

In my mental model, where I'd apply `np.dot` on the data and the mask separately, the result will be wrong, so the mask has to be set (which it would be). For your specific example, that might not be the best solution, but when using `np.dot(matrix_shaped, matrix_shaped)`, I think it does give the correct masking: any masked element in a matrix better propagate to all parts that it influences, even if there is a reduction of sorts happening. So, perhaps a price to pay for a function that tries to do multiple things.

The alternative solution in my model would be to replace `np.dot` with a masked-specific implementation of what `np.dot` is supposed to stand for (in your simple example, `np.add.reduce(np.multiply(m, m))` - more generally, add relevant `outer` and `axes`). This would be similar to what I think all implementations do for `.mean()` - we cannot calculate that from the data using any fill value or skipping, so rather use a more easily cared-for `.sum()` and divide by a suitable number of elements. But in both examples the disadvantage is that we took away the option to use the underlying class's `.dot()` or `.mean()` implementations.

Just to note, my current implementation uses the IGNORE style of mask, so does not propagate the mask in np.dot: >>> a = MaskedArray([[1,1,1], [1,X,1], [1,1,1]]) >>> np.dot(a, a) MaskedArray([[3, 2, 3], [2, 2, 2], [3, 2, 3]]) I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior. The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation. best, Allan

...

(Aside: considerations such as these underlie my longed-for exposure of standard implementations of functions in terms of basic ufunc calls.)

Another example of a function for which I think my model is not particularly insightful (and for which it is difficult to know what to do generally) is `np.fft.fft`. Since an fft is equivalent to a sine/cosine fits to data points, the answer for masked data is in principle quite well-defined. But much less easy to implement!

All the best,

Marten

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

Stephan Hoyer

7:21 p.m.

On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...

I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.

The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation.

This does seem like a clean way to *implement* things, but from a user perspective I'm not sure I would want separate classes for "IGNORE" vs "NA" masks. I tend to think of "IGNORE" vs "NA" as descriptions of particular operations rather than the data itself. There are a spectrum of ways to handle missing data, and the right way to propagating missing values is often highly context dependent. The right way to set this is in functions where operations are defined, not on classes that may be defined far away from where the computation happen. For example, pandas has a "min_count" parameter in functions for intermediate use-cases between "IGNORE" and "NA" semantics, e.g., "take an average, unless the number of data points is fewer than min_count." Are there examples of existing projects that define separate user-facing types for different styles of masks?

Marten van Kerkwijk

8:34 p.m.

On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer@gmail.com> wrote:

...

On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...
I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.

The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation.

This does seem like a clean way to *implement* things, but from a user perspective I'm not sure I would want separate classes for "IGNORE" vs "NA" masks.

I tend to think of "IGNORE" vs "NA" as descriptions of particular operations rather than the data itself. There are a spectrum of ways to handle missing data, and the right way to propagating missing values is often highly context dependent. The right way to set this is in functions where operations are defined, not on classes that may be defined far away from where the computation happen. For example, pandas has a "min_count" parameter in functions for intermediate use-cases between "IGNORE" and "NA" semantics, e.g., "take an average, unless the number of data points is fewer than min_count."

Anything that specific like that is probably indeed outside of the purview of a MaskedArray class. But your general point is well taken: we really need to ask clearly what the mask means not in terms of operations but conceptually. Personally, I guess like Benjamin I have mostly thought of it as "data here is bad" (because corrupted, etc.) or "data here is irrelevant" (because of sea instead of land in a map). And I would like to proceed nevertheless with calculating things on the remainder. For an expectation value (or, less obviously, a minimum or maximum), this is mostly OK: just ignore the masked elements. But even for something as simple as a sum, what is correct is not obvious: if I ignore the count, I'm effectively assuming the expectation is symmetric around zero (this is why `vector.dot(vector)` fails); a better estimate would be `np.add.reduce(data, where=~mask) * N(total) / N(unmasked)`. Of course, the logical conclusion would be that this is not possible to do without guidance from the user, or, thinking more, that really a masked array is not at all what I want for this problem; really I am just using (1-mask) as a weight, and the sum of the weights matters, so I should have a WeightArray class where that is returned along with the sum of the data (or, a bit less extreme, a `CountArray` class, or, more extreme, a value and its uncertainty - heck, sounds a lot like my Variable class from 4 years ago, https://github.com/astropy/astropy/pull/3715, which even takes care of covariance [following the Uncertainty package]). OK, it seems I've definitely worked myself in a corner tonight where I'm not sure any more what a masked array is good for in the first place... I'll stop for the day! All the best, Marten

Stephan Hoyer

9:03 p.m.

On Mon, Jun 24, 2019 at 5:36 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:

...

On Mon, Jun 24, 2019 at 7:21 PM Stephan Hoyer <shoyer@gmail.com> wrote:

...
On Mon, Jun 24, 2019 at 3:56 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...
I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.

The behavior you described further above where the output row/col would be masked corresponds better to "NA" (propagating) mask behavior, which I am leaving for later implementation.

This does seem like a clean way to *implement* things, but from a user perspective I'm not sure I would want separate classes for "IGNORE" vs "NA" masks.

I tend to think of "IGNORE" vs "NA" as descriptions of particular operations rather than the data itself. There are a spectrum of ways to handle missing data, and the right way to propagating missing values is often highly context dependent. The right way to set this is in functions where operations are defined, not on classes that may be defined far away from where the computation happen. For example, pandas has a "min_count" parameter in functions for intermediate use-cases between "IGNORE" and "NA" semantics, e.g., "take an average, unless the number of data points is fewer than min_count."

Anything that specific like that is probably indeed outside of the purview of a MaskedArray class.

I agree that it doesn't make much sense to have a "min_count" attribute on a MaskedArray class, but certainly it makes sense for operations on MaskedArray objects, e.g., to write something like masked_array.mean(min_count=10). This is what users do in pandas today.

...

But your general point is well taken: we really need to ask clearly what the mask means not in terms of operations but conceptually.

Personally, I guess like Benjamin I have mostly thought of it as "data here is bad" (because corrupted, etc.) or "data here is irrelevant" (because of sea instead of land in a map). And I would like to proceed nevertheless with calculating things on the remainder. For an expectation value (or, less obviously, a minimum or maximum), this is mostly OK: just ignore the masked elements. But even for something as simple as a sum, what is correct is not obvious: if I ignore the count, I'm effectively assuming the expectation is symmetric around zero (this is why `vector.dot(vector)` fails); a better estimate would be `np.add.reduce(data, where=~mask) * N(total) / N(unmasked)`.

I think it's fine and logical to define default semantics for operations on MaskedArray objects. Much of the time, replacing masked values with 0 is the right thing to do for sum. Certainly IGNORE semantics are more useful overall than the NA semantics. But even if a MaskedArray conceptually always represents "bad" or "irrelevant" data, the way to handle those missing values will differ based on the use case, and not everything will fall cleanly into either IGNORE or NA buckets. I think it makes sense to provide users with functions/methods that expose these options, rather than requiring that they convert their data into a different type MaskedArray. "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." —Alan Perlis https://stackoverflow.com/questions/6016271/why-is-it-better-to-have-100-fun...

Marten van Kerkwijk

7:54 p.m.

Hi Allan,

...

The alternative solution in my model would be to replace `np.dot` with a

...
masked-specific implementation of what `np.dot` is supposed to stand for (in your simple example, `np.add.reduce(np.multiply(m, m))` - more generally, add relevant `outer` and `axes`). This would be similar to what I think all implementations do for `.mean()` - we cannot calculate that from the data using any fill value or skipping, so rather use a more easily cared-for `.sum()` and divide by a suitable number of elements. But in both examples the disadvantage is that we took away the option to use the underlying class's `.dot()` or `.mean()` implementations.

Just to note, my current implementation uses the IGNORE style of mask, so does not propagate the mask in np.dot:

>>> a = MaskedArray([[1,1,1], [1,X,1], [1,1,1]]) >>> np.dot(a, a)

MaskedArray([[3, 2, 3], [2, 2, 2], [3, 2, 3]])

I'm not at all set on that behavior and we can do something else. For now, I chose this way since it seemed to best match the "IGNORE" mask behavior.

It is a nice example, I think. In terms of action on the data, one would get this result as well in my pseudo-representation of `np.add.reduce(np.multiply(m, m))` - as long as the multiply is taken as outer product along the relevant axes (which does not ignore the mask, i.e., if either element is masked, the product is too), and subsequently a sum which works like other reductions and skips masked elements.

...

From the FFT array multiplication analogy, though, it is not clear that, effectively, replacing masked elements by 0 is reasonable.

Equivalently, thinking of `np.dot` in its 1-D form as presenting the length of the projection of one vector along another, it is not clear what a single masked element is supposed to mean. In a way, masking just one element of a vector or of a matrix makes vector or matrix operations meaningless. I thought fitting data with a mask might give a counterexample, but in that one usually calculates at some point r = y - A x, so no masking of the matrix, and subtraction y-Ax passing on a mask, and summing of r ignoring masked elements does just the right thing. All the best, Marten

Stephan Hoyer

2:29 a.m.

On Thu, Jun 20, 2019 at 7:44 PM Allan Haldane <allanhaldane@gmail.com> wrote:

...

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:

...
Hi Allan,

This is very impressive! I could get the tests that I wrote for my class pass with yours using Quantity with what I would consider very minimal changes. I only could not find a good way to unmask data (I like the idea of setting the mask on some elements via `ma[item] = X`); is this on purpose?

Yes, I want to make it difficult for the user to access the garbage values under the mask, which are often clobbered values. The only way to "remove" a masked value is by replacing it with a new non-masked value.

I think we should make it possible to access (and even mutate) data under the mask directly, while noting the lack of any guarantees about what those values are. MaskedArray has a minimal and transparent data model, consisting of data and mask arrays. There are plenty of use cases where it is convenient to access the underlying arrays directly, e.g., for efficient implementation of low-level MaskedArray algorithms. NumPy itself does a similar thing on ndarray by exposing data/strides. Advanced users who learn the details of the data model find them useful, and everyone else ignores them.

...

...
Anyway, it would seem easily at the point where I should comment on your repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more complete ducktype to pass into MaskedArray to test, so that's what I'll work on next, when I have time. I'll either try to finish my ArrayCollection type, or try making a simple NDunit ducktype piggybacking on astropy's Unit.

dask.array would be another good example to try. I think it already should support __array_function__ (and if not it should be very easy to add).

...

Best, Allan

...
All the best,

Marten

On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:

On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> > <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote: > <snip> > > > This may be too much to ask from the initializer, but, if so, it still > > seems most useful if it is made as easy as possible to do, say, `class > > MaskedQuantity(Masked, Quantity): <very few overrides>`. > > Currently MaskedArray does not accept ducktypes as underlying arrays, > but I think it shouldn't be too hard to modify it to do so. Good idea! > > > Looking back at my trial, I see that I also never got to duck

arrays -

...
> only ndarray subclasses - though I tried to make the code as agnostic as > possible. > > (Trial at >

https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?... )

...
> > I already partly navigated this mixin-issue in the > "MaskedArrayCollection" class, which essentially does > ArrayCollection(MaskedArray(array)), and only takes about 30 lines of > boilerplate. That's the backwards encapsulation order from what you want > though. > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m, > mask=[True, False, False])` does indeed not have a `.unit`

attribute

...
> (and cannot represent itself...); I'm not at all sure that my method of > just creating a mixed class is anything but a recipe for disaster, though!

Based on your suggestion I worked on this a little today, and now my MaskedArray more easily encapsulates both ducktypes and ndarray subclasses (pushed to repo). Here's an example I got working with

masked

...
units using unyt:

[1]: from MaskedArray import X, MaskedArray, MaskedScalar

[2]: from unyt import m, km

[3]: import numpy as np

[4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])

[5]: uarr

MaskedArray([1., X , 3.]) [6]: uarr + 1*m

MaskedArray([1.001, X , 3.001]) [7]: uarr.filled()

unyt_array([1., 0., 3.], 'km') [8]: np.concatenate([uarr, 2*uarr]).filled() unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')

The catch is the ducktype/subclass has to rigorously follow numpy's indexing rules, including distinguishing 0d arrays from scalars. For

now

...
only I used unyt in the example above since it happens to be less

strict

...
about dimensionless operations than astropy.units which trips up my repr code. (see below for example with astropy.units). Note in the

last

...
line I lost the dimensions, but that is because unyt does not handle np.concatenate. To get that to work we need a true ducktype for

units.

...
The example above doesn't expose the ".units" attribute outside the MaskedArray, and it doesn't print the units in the repr. But you can access them using "filled".

While I could make MaskedArray forward unknown attribute accesses to

the

...
encapsulated array, that seems a bit dangerous/bug-prone at first glance, so probably I want to require the user to make a MaskedArray subclass to do so. I've just started playing with that (probably

buggy),

...
and Ive attached subclass examples for astropy.unit and unyt, with

some

...
example output below.

Cheers, Allan

Example using the attached astropy unit subclass:

>>> from astropy.units import m, km, s >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) >>> uarr MaskedQ([1., X , 1.], units=km) >>> uarr.units km >>> uarr + (1*m) MaskedQ([1.001, X , 1.001], units=km) >>> uarr/(1*s) MaskedQ([1., X , 1.], units=km / s) >>> (uarr*(1*m))[1:] MaskedQ([X , 1.], units=km m) >>> np.add.outer(uarr, uarr) MaskedQ([[2., X , 2.], [X , X , X ], [2., X , 2.]], units=km) >>> print(uarr) [1. X 1.] km m

Cheers, Allan

> > Even if this impossible, I think it is conceptually useful to think > > about what the masking class should do. My sense is that, e.g., it > > should not attempt to decide when an operation succeeds or

not,

...
> but just > > "or together" input masks for regular, multiple-input

functions,

...
> and let > > the underlying arrays skip elements for reductions by using `where` > > (hey, I did implement that for a reason... ;-). In particular, it > > suggests one should not have things like domains and all that (I never > > understood why `MaskedArray` did that). If one wants more, the class > > should provide a method that updates the mask (a sensible default > might > > be `mask |= ~np.isfinite(result)` - here, the class being

masked

...
> should > > logically support ufuncs and functions, so it can decide what > "isfinite" > > means). > > I agree it would be nice to remove domains. It would make life easier, > and I could remove a lot of twiddly code! I kept it in for now

to

...
> minimize the behavior changes from the old MaskedArray. > > > That makes sense. Could be separated out to a

backwards-compatibility

...
> class later. > > > > In any case, I would think that a basic truth should be that > everything > > has a mask with a shape consistent with the data, so > > 1. Each complex numbers has just one mask, and setting `a.imag` with a > > masked array should definitely propagate the mask. > > 2. For a masked array with structured dtype, I'd similarly

say

...
> that the > > default is for a mask to have the same shape as the array. But that > > something like your collection makes sense for the case where one > wants > > to mask items in a structure. > > Agreed that we should have a single bool per complex or

structured

...
> element, and the mask shape is the same as the array shape. That's how I > implemented it. But there is still a problem with complex.imag > assignment: > > >>> a = MaskedArray([1j, 2, X]) > >>> i = a.imag > >>> i[:] = MaskedArray([1, X, 1]) > > If we make the last line copy the mask to the original array,

what

...
> should the real part of a[2] be? Conversely, if we don't copy the mask, > what should the imag part of a[1] be? It seems like we might "want" the > masks to be OR'd instead, but then should i[2] be masked after we just > set it to 1? > > Ah, I see the issue now... Easiest to implement and closest in

analogy

...
> to a regular view would be to just let it unmask a[2] (with whatever is > in real; user beware!). > > Perhaps better would be to special-case such that `imag` returns a > read-only view of the mask. Making `imag` itself read-only would prevent > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - but there is > no reason this should update the mask. > > Still, neither is really satisfactory... > > > > > p.s. I started trying to implement the above "Mixin" class;

will

...
> try to > > clean that up a bit so that at least it uses `where` and push it up. > > I played with "where", but didn't include it since 1.17 is not released. > To avoid duplication of effort, I've attached a diff of what I tried. I > actually get a slight slowdown of about 10% by using where... > > > Your implementation is indeed quite similar to what I got in > __array_ufunc__ (though one should "&" the where with ~mask). > > I think the main benefit is not to presume that whatever is

underneath

...
> understands 0 or 1, i.e., avoid filling. > > > If you make progress with the mixin, a push is welcome. I imagine a > problem is going to be that np.isscalar doesn't work to detect duck > scalars. > > I fear that in my attempts I've simply decided that only array

scalars

...
> exist... > > -- Marten > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > https://mail.python.org/mailman/listinfo/numpy-discussion >

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion

2062

Age (days ago)

2070

Last active (days ago)

List overview

Download

37 comments

9 participants

participants (9)

Aldcroft, Thomas
Allan Haldane
Benjamin Root
Charles R Harris
Eric Firing
Eric Wieser
Marten van Kerkwijk
Stephan Hoyer
Warren Weckesser

new MaskedArray class

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

Marten van Kerkwijk

tags

participants (9)