asarray/anyarray; matrix/subclass
Begin forwarded message:
From: Stephan Hoyer Date: Friday, Nov 09, 2018 at 3:19 PM To: Hameer Abbasi Cc: Stefan van der Walt , Marten van Kerkwijk Subject: asarray/anyarray; matrix/subclass
This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list). On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <einstein.edison@gmail.com (mailto:einstein.edison@gmail.com)> wrote:
Hi Stephan,
The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.
One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duck-array related issues without introducing another protocol.
Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an array-ish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.
There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.
Best Regards, Hameer Abbasi
On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <shoyer@gmail.com (mailto:shoyer@gmail.com)> wrote: Hi Hameer,
I'd love to talk about this in more detail. I agree that something like this is needed.
The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix.
NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP.
Cheers, Stephan On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <einstein.edison@gmail.com (mailto:einstein.edison@gmail.com)> wrote:
No, Stefan, I’ll do that now. Putting you in the cc.
It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.
Hi, Marten, Stephan:
Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duck-array semantics.
We were thinking of switching asanyarray to switch to passing through anything that implements the duck-array protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.
Would you be interested?
Best Regards, Hameer Abbasi
On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt <stefanv@berkeley.edu (mailto:stefanv@berkeley.edu)> wrote: Hi Hameer,
In last week's meeting, we had the following in the notes:
Hameer is contacting Marten & Stephan and write up a draft NEP for clarifying the asarray/asanyarray and matrix/subclass path forward.
Did any of that happen that you could share?
Thanks and best regards, Stéfan
Hello, everyone, Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding) Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2-D object when eliminating one dimension, but a 0-D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than element-wise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle. Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array. The changes I’m proposing are twofold, but simple: asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period. In the spirit of allowing duck-arrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should ducka-arrays as well. This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse. Best Regards, Hameer Abbasi
I’m still not sure I agree with the advantages of reusing asanyarray(), even if matrix did not exist. Yes, asanyarray will exist in old NumPy versions, but you can’t use it with sparse arrays anyways because it will have the wrong semantics. I expect this would be a bug magnet, with inadvertent loading of sparse arrays into memory if you’re accidentally using old NumPy. With regards to the protocol, I would suggest a dedicated method, e.g., __asanyarray__ (or something similar based on the final chosen name of the function). Coercing to arrays is special enough to have its own dedicated protocol, and it could be useful for libraries like xarray to check for __asanyarray__ attributes before deciding which coercion mechanism to use. On Fri, Nov 9, 2018 at 10:17 AM Hameer Abbasi <einstein.edison@gmail.com> wrote:
Begin forwarded message:
From: Stephan Hoyer Date: Friday, Nov 09, 2018 at 3:19 PM To: Hameer Abbasi Cc: Stefan van der Walt , Marten van Kerkwijk Subject: asarray/anyarray; matrix/subclass
This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list). On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <einstein.edison@gmail.com> wrote:
Hi Stephan,
The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.
One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duck-array related issues without introducing another protocol.
Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an array-ish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.
There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.
Best Regards, Hameer Abbasi
On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote: Hi Hameer,
I'd love to talk about this in more detail. I agree that something like this is needed.
The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix.
NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP.
Cheers, Stephan On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
No, Stefan, I’ll do that now. Putting you in the cc.
It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.
Hi, Marten, Stephan:
Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duck-array semantics.
We were thinking of switching asanyarray to switch to passing through anything that implements the duck-array protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.
Would you be interested?
Best Regards, Hameer Abbasi
On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt < stefanv@berkeley.edu> wrote: Hi Hameer,
In last week's meeting, we had the following in the notes:
Hameer is contacting Marten & Stephan and write up a draft NEP for clarifying the asarray/asanyarray and matrix/subclass path forward.
Did any of that happen that you could share?
Thanks and best regards, Stéfan
Hello, everyone,
Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)
Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2-D object when eliminating one dimension, but a 0-D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than element-wise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.
Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.
The changes I’m proposing are twofold, but simple:
- asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period. - In the spirit of allowing duck-arrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should ducka-arrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse.
Best Regards, Hameer Abbasi
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
But matrix isn't the only problem with asanyarray. np.ma also violates Liskov. No doubt there are other problematic ndarray subclasses out there too... If we were going to try to reuse asanyarray through some deprecation mechanism, I think we'd need to deprecate allowing asanyarray to return *any* ndarray subclass, unless they explicitly provided an __asanyarray__ dunder. But at that point I'm not sure what the point would be of reusing it. On Fri, Nov 9, 2018 at 7:15 AM, Hameer Abbasi <einstein.edison@gmail.com> wrote:
Begin forwarded message:
From: Stephan Hoyer Date: Friday, Nov 09, 2018 at 3:19 PM To: Hameer Abbasi Cc: Stefan van der Walt , Marten van Kerkwijk Subject: asarray/anyarray; matrix/subclass
This is a great discussion, but let's try to have it in public (e.g., on the NumPy mailing list). On Fri, Nov 9, 2018 at 8:42 AM Hameer Abbasi <einstein.edison@gmail.com> wrote:
Hi Stephan,
The issue I have with writing another function is that asarray/asanyarray are so widely used that it’d be a huge maintenance task to update them throughout NumPy, not to mention other codebases, not to mention other codebases having to rely on newer NumPy versions for this. In short, it would dramatically reduce adaptability of this function.
One path we can take is to allow asarray/asanyarray to be overridable via __array_function__ (the former is debatable). This solves most of our duck-array related issues without introducing another protocol.
Regardless of what path we choose, I would recommend changing asanyarray to not pass through np.matrix regardless, instead passing through mat.view(type=np.ndarray) instead, which has O(1) cost and memory. In the vast majority of contexts, it’s used to ensure an array-ish structure for another operation, and usually there’s no guarantee that what comes out will be a matrix anyway. I suggest we raise a FutureWarning and then change this behaviour.
There have been a number of discussions about deprecating np.matrix (and a few about MaskedArray as well, though there are less compelling reasons for that one). I suggest we start down that path as soon as possible. The biggest (only?) user I know of blocking that is scipy.sparse, and we’re on our way to replacing that with PyData/Sparse.
Best Regards, Hameer Abbasi
On Friday, Nov 09, 2018 at 1:26 AM, Stephan Hoyer <shoyer@gmail.com> wrote: Hi Hameer,
I'd love to talk about this in more detail. I agree that something like this is needed.
The challenge with reusing an existing function like asanyarray() is that there is at least one (somewhat?) widely used ndarray subclass that badly violates the Liskov Substitution Principle: np.matrix.
NumPy can't really use np.asanyarray() widely for internal purposes until we don't have to worry about np matrix. We might special case np.matrix in some way, but then asanyarray() would do totally opposite things on different versions of NumPy. It's almost certainly a better idea to just write a new function with the desired semantics, and "soft deprecate" asanyarray(). The new function can explicitly black list np.matrix, as well as any other subclasses we know of that badly violate LSP.
Cheers, Stephan On Thu, Nov 8, 2018 at 5:06 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
No, Stefan, I’ll do that now. Putting you in the cc.
It slipped my mind among the million other things I had in mind — Namely: My job visa. It was only done this Monday.
Hi, Marten, Stephan:
Stefan wants me to write up a NEP that allows a given object to specify that it is a duck array — Namely, that it follows duck-array semantics.
We were thinking of switching asanyarray to switch to passing through anything that implements the duck-array protocol along with ndarray subclasses. I’m sure this would help XArray and Quantity work better with existing codebases, along with PyData/Sparse arrays.
Would you be interested?
Best Regards, Hameer Abbasi
On Thursday, Nov 08, 2018 at 9:09 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote: Hi Hameer,
In last week's meeting, we had the following in the notes:
Hameer is contacting Marten & Stephan and write up a draft NEP for clarifying the asarray/asanyarray and matrix/subclass path forward.
Did any of that happen that you could share?
Thanks and best regards, Stéfan
Hello, everyone,
Me, Stefan van der Walt, Stephan Hoyer and Marten van Kerkwijk were having a discussion about the state of matrix, asarray and asanyarray. Our thoughts are summarised above (in the quoted text that I’m forwarding)
Basically, this grew out of a discussion relating to asanyarray/asarray inconsistencies in NumPy about which to use where. Historically, asarray was used in many libraries/places instead of asanyarray usually because np.matrix caused problems due to its special behaviour with regard to indexing (it always returns a 2-D object when eliminating one dimension, but a 0-D one when eliminating both), its behaviour regarding __mul__ (the multiplication operator represents matrix multiplication rather than element-wise multiplication) and its fixed dimensionality (matrix is 2D only). Because of these three things, as Stephan accurately pointed out, it violates the Liskov Substitution Principle.
Because of this behaviour, many libraries switched from using asanyarray to asarray, as np.matrix wouldn’t work with their code. This shut out other matrix subclasses from being used as well, such as MaskedArray and astropy.Quantity. Even if asanyarray is used, there is usually no guarantee that a matrix will be returned instead of an array.
The changes I’m proposing are twofold, but simple:
asanyarray should return mat.view(type=np.ndarray) instead of matrices, after an appropriate time with a FutureWarning. This allows us to preserve the performance (Creating a view is O(1) both in memory and time), and the mutability of the original matrix. This change should happen after a FutureWarning and the usual grace period. In the spirit of allowing duck-arrays to work with existing NumPy code, asanyarray should be overridable via __array_function__, so that duck arrays can decide whether to pass themselves through. If subclasses are allowed, so should ducka-arrays as well.
This is a part of a larger effort to deprecate np.matrix. As far as I’m aware, it has one big customer (scipy.sparse). The effort to replace that is already underway at PyData/Sparse.
Best Regards, Hameer Abbasi
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- https://vorpus.org
On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <njs@pobox.com> wrote:
But matrix isn't the only problem with asanyarray. np.ma also violates Liskov. No doubt there are other problematic ndarray subclasses out there too...
Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't they work the same as base numpy arrays, except with operations keeping track of masks? I'm sure there are some cases where masked arrays have different semantics than NumPy arrays, but are any of these intentional? I would guess that the worst current violation is that there is a risk of losing mask information in some operations, but implementing __array_function__ would presumably make it possible to fix most of these.
On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <njs@pobox.com> wrote:
But matrix isn't the only problem with asanyarray. np.ma also violates Liskov. No doubt there are other problematic ndarray subclasses out there too...
Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't they work the same as base numpy arrays, except with operations keeping track of masks?
Since many operations silently skip over masked values, the computation semantics are different. For example, in a regular array, sum()/size() == mean(), but with a masked array these are totally different operations. So if you have code that was written for regular arrays, but pass in a masked array, there's a solid chance that it will silently return nonsensical results. (This is why it's better for NAs to propagate by default.) -n -- Nathaniel J. Smith -- https://vorpus.org
On 9/11/18 5:09 pm, Nathaniel Smith wrote:
On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <njs@pobox.com> wrote:
But matrix isn't the only problem with asanyarray. np.ma also violates Liskov. No doubt there are other problematic ndarray subclasses out there too...
Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't they work the same as base numpy arrays, except with operations keeping track of masks? Since many operations silently skip over masked values, the computation semantics are different. For example, in a regular array, sum()/size() == mean(), but with a masked array these are totally different operations. So if you have code that was written for regular arrays, but pass in a masked array, there's a solid chance that it will silently return nonsensical results.
(This is why it's better for NAs to propagate by default.)
-n
Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/neps Matti
Hi Hameer, I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`). I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view. All the best, Marten p.s. Note that we are already giving PendingDeprecationWarning for matrix; https://github.com/numpy/numpy/pull/10142. On Sat, Nov 10, 2018 at 11:02 AM Matti Picus <matti.picus@gmail.com> wrote:
On Fri, Nov 9, 2018 at 4:59 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Fri, Nov 9, 2018 at 6:46 PM Nathaniel Smith <njs@pobox.com> wrote:
But matrix isn't the only problem with asanyarray. np.ma also violates Liskov. No doubt there are other problematic ndarray subclasses out there too...
Please forgive my ignorance (I don't really use mask arrays), but how specifically do masked arrays violate Liskov? In most cases shouldn't
work the same as base numpy arrays, except with operations keeping
On 9/11/18 5:09 pm, Nathaniel Smith wrote: they track of
masks? Since many operations silently skip over masked values, the computation semantics are different. For example, in a regular array, sum()/size() == mean(), but with a masked array these are totally different operations. So if you have code that was written for regular arrays, but pass in a masked array, there's a solid chance that it will silently return nonsensical results.
(This is why it's better for NAs to propagate by default.)
-n
Echos of the discussions in neps 12, 24, 25, 26. http://www.numpy.org/neps
Matti
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Hameer,
I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly. Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray). Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base. To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False. 2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses. P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :).
If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable
One of the ways to fix these liskov substitution problems is just to introduce more base classes - for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not. Eric On Sat, 10 Nov 2018 at 12:17 Stephan Hoyer <shoyer@gmail.com> wrote:
On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Hameer,
I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.
Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).
Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.
To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False. 2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :). _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Nov 10, 2018 at 2:15 PM Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable
One of the ways to fix these liskov substitution problems is just to introduce more base classes - for instance, if we had an `NDContainer` base class with only slicing support, then masked arrays would be an exact liskov substitution, but np.matrix would not.
Eric
I've had the same thought and wouldn't be surprised if others have considered that possibility. Travis would be a good guy to ask about that. <snip> Chuck
On Saturday, Nov 10, 2018 at 9:16 PM, Stephan Hoyer <shoyer@gmail.com (mailto:shoyer@gmail.com)> wrote: On Sat, Nov 10, 2018 at 9:49 AM Marten van Kerkwijk <m.h.vankerkwijk@gmail.com (mailto:m.h.vankerkwijk@gmail.com)> wrote:
Hi Hameer,
I do not think we should change `asanyarray` itself to special-case matrix; rather, we could start converting `asarray` to `asanyarray` and solve the problems that produces for matrices in `matrix` itself (e.g., by overriding the relevant function with `__array_function__`).
I think the idea of providing an `__anyarray__` method (in analogy with `__array__`) might work. Indeed, the default in `ndarray` (and thus all its subclasses) could be to let it return `self` and to override it for `matrix` to return an ndarray view.
Yes, we certainly would rather implement a matrix.__anyarray__ method (if we're already doing a new protocol) rather than special case np.matrix explicitly.
Unfortunately, per Nathaniel's comments about NA skipping behavior, it seems like we will also need MaskedArray.__anyarray__ to return something other than itself. In principle, we should probably write new version of MaskedArray that doesn't deviate from ndarray semantics, but that's a rather large project (we'd also probably want to stop subclassing ndarray).
Changing the default aggregation behavior for the existing MaskedArray is also an option but that would be a serious annoyance to users and backwards compatibility break. If the only way MaskedArray violates Liskov is in terms of NA skipping aggregations by default, then this might be viable. In practice, this would require adding an explicit skipna argument so FutureWarnings could be silenced. The plus side of this option is that it would make it easier to use np.anyarray() or any new coercion function throughout the internal NumPy code base.
To summarize, I think these are our options: 1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
P.S. I'm just glad pandas stopped subclassing ndarray a while ago -- there's no way pandas.Series() could be fixed up to not violate Liskov :). _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray. I don't know how people who currently use MaskedArray would feel about that. I would love to hear their thoughts.
On Sat, Nov 10, 2018 at 5:39 PM Stephan Hoyer <shoyer@gmail.com> wrote:
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <einstein.edison@gmail.com> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.
Might be good to try before worrying too much - MaskedArray already overrides *a lot*; it is not at all obvious to me that things wouldn't "just work" if we bulk-replaced `asarray` with `asanyarray`. And with `__array_function__` we now have the option to fix code paths that do not work immediately. -- Marten
On 2018/11/10 12:39 PM, Stephan Hoyer wrote:
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <einstein.edison@gmail.com <mailto:einstein.edison@gmail.com>> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.
I don't know how people who currently use MaskedArray would feel about that. I would love to hear their thoughts.
Thank you. I am a user of masked arrays, and have been since pre-numpy days. I introduced their extensive use in matplotlib long ago. I have been a bit concerned, indeed, that all of the discussion of modifying masked arrays seems to be by people who don't actually use them explicitly (though they might be using them without knowing it via internal operations in matplotlib, or they might be quickly getting rid of them after they are yielded by netCDF4.Dataset()). I think that those of us who do use masked arrays recognize that they are not perfect; they have some quirks and gotchas, and one has to be careful to use numpy.ma functions instead of numpy functions in most cases. But we use them because they have real advantages over the alternatives, which are using nans and/or manually tracking independent masks throughout calculations. These advantages are largely because masked values *don't* behave like nan, *don't* propagate. This is fundamental to the design, and motivated by real-life use cases. The proposal to add a skipna kwarg to MaskedArray looks to me like it is giving purity priority over practicality. It will force ma users to insert skipna kwargs all over the place--because the default will be contrary to the primary purposes of using masked arrays, in most cases. How many people will it actually benefit? How many people are being bitten, and how badly, by masked array behavior? If there were a prospect of truly integrating missing/masked value handling into numpy, simplifying or phasing out numpy.ma, I would be delighted--I think it is the biggest single fundamental improvement that could be made, from the user's standpoint. I was sad to see Mark Wiebe's work in that direction come to grief. If there are ways of gradually improving numpy.ma and its interoperability with the rest of numpy and with the proliferation of duck arrays, I'm all in favor--so long as they don't effectively wreck numpy.ma for its present intended purposes. Eric
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi Eric, Thanks very much for the detailed response; it is good to be reminded that `MaskedArray` is used in a package that, indeed, (nearly?) all of us use! But I do think that those of us who have been trying to change MaskedArray, are generally good at making sure the tests continue to pass, i.e., that the behaviour does not change (the main exception in the last few years was that views should be taken of masks too, not just the data). I also think that between __array_ufunc__ and __array_function__, it has become quite easy to ensure that one no longer has to rely on `np.ma` functions, i.e., that the regular numpy functions will do the right thing. But it will need work to actually implement that. All the best, Marten
On Sat, Nov 10, 2018 at 10:45 PM Eric Firing <efiring@hawaii.edu> wrote:
On 2018/11/10 12:39 PM, Stephan Hoyer wrote:
On Sat, Nov 10, 2018 at 2:22 PM Hameer Abbasi <einstein.edison@gmail.com <mailto:einstein.edison@gmail.com>> wrote:
To summarize, I think these are our options:
1. Change the behavior of np.anyarray() to check for an __anyarray__() protocol. Change np.matrix.__anyarray__() to return a base numpy array (this is a minor backwards compatibility break, but probably for the best). Start issuing a FutureWarning for any MaskedArray operations that violate Liskov and add a skipna argument that in the future will default to skipna=False.
2. Introduce a new coercion function, e.g., np.duckarray(). This is the easiest option because we don't need to cleanup NumPy's existing ndarray subclasses.
My vote is still for 1. I don’t have an issue for PyData/Sparse depending on recent-ish NumPy versions — It’ll need a lot of the recent protocols anyway, although I could be convinced otherwise if major package devs (scikits, SciPy, Dask) were to weigh in and say they’ll jump on it (which seems unlikely given SciPy’s policy to support old NumPy versions).
I agree that option (1) is fine for PyData/sparse. The bigger issue is that this change should be conditional on making breaking changes (at least raising FutureWarning for now) to np.ma.MaskedArray.
I don't know how people who currently use MaskedArray would feel about that. I would love to hear their thoughts.
Thank you. I am a user of masked arrays, and have been since pre-numpy days. I introduced their extensive use in matplotlib long ago. I have been a bit concerned, indeed, that all of the discussion of modifying masked arrays seems to be by people who don't actually use them explicitly (though they might be using them without knowing it via internal operations in matplotlib, or they might be quickly getting rid of them after they are yielded by netCDF4.Dataset()).
I think that those of us who do use masked arrays recognize that they are not perfect; they have some quirks and gotchas, and one has to be careful to use numpy.ma functions instead of numpy functions in most cases. But we use them because they have real advantages over the alternatives, which are using nans and/or manually tracking independent masks throughout calculations. These advantages are largely because masked values *don't* behave like nan, *don't* propagate. This is fundamental to the design, and motivated by real-life use cases.
The proposal to add a skipna kwarg to MaskedArray looks to me like it is giving purity priority over practicality. It will force ma users to insert skipna kwargs all over the place--because the default will be contrary to the primary purposes of using masked arrays, in most cases. How many people will it actually benefit? How many people are being bitten, and how badly, by masked array behavior?
If there were a prospect of truly integrating missing/masked value handling into numpy, simplifying or phasing out numpy.ma, I would be delighted--I think it is the biggest single fundamental improvement that could be made, from the user's standpoint. I was sad to see Mark Wiebe's work in that direction come to grief.
If there are ways of gradually improving numpy.ma and its interoperability with the rest of numpy and with the proliferation of duck arrays, I'm all in favor--so long as they don't effectively wreck numpy.ma for its present intended purposes.
Eric -- thank you for sharing your perspective! I guess it should not be surprising that the semantics of MaskedArray intentionally deviate from the semantics of base NumPy arrays. This deviation is fortunately less severe than than deviations in the behavior of np.matrix, but it still presents some difficulties for duck typing. We're in a position to reduce (but still not eliminate) these differences with new protocols like __array_function__. I think Nathaniel actually summarized these issues pretty well in NEP 16 ( http://www.numpy.org/neps/nep-0016-abstract-array.html). If we want a coercion function that guarantees an object is a "full duck array", then it can't pass on either np.matrix or MaskedArray in their current state. Anything less than full compatibility provides a shaky foundation for use in downstream projects or inside NumPy itself. In theory (certainly if we were starting from scratch) it would make sense to make asabstractarray() pass on any ndarray subclass, but this would require willingness to make breaking changes to both np.matrix and MaskedArray. I would suggest adopting a variation of the proposal in NEP 16, except using a protocol rather an abstract base class per NEP 22, e.g., # names still to be determined def asabstractarray(array, dtype): if hasattr(array, '__abstractarray__'): return array.__abstractarray__(array, dtype=dtype) return asarray(array, dtype)
participants (8)
-
Charles R Harris
-
Eric Firing
-
Eric Wieser
-
Hameer Abbasi
-
Marten van Kerkwijk
-
Matti Picus
-
Nathaniel Smith
-
Stephan Hoyer