Re: [Numpy-discussion] is __array_ufunc__ ready for prime-time?
Thank you all kindly for your responses! Based on your encouragement, I will pursue an ndarray subclass / __array_ufunc__ implementation. I had been toying with np.set_numeric_ops, which is less than ideal (for example, np.ndarray.around segfaults if I use set_numeric_ops in any way). A second question: very broadly speaking, how much 'pain' can I expect trying to use an np.ndarray subclass in the broader python scientific computing ecosystem, and is there general consensus that projects 'should' support ndarray subclasses? Will
We spent a *long time* sorting out the messy details of __array_ufunc__ [1], especially for handling interactions between different types, e.g., between numpy arrays, non-numpy array-like objects, builtin Python objects, objects that override arithmetic to act in non-numpy-like ways, and of course subclasses of all the above.
We hope that we have it right this time, but as we wrote in the NumPy 1.13 release notes "The API is provisional, we do not yet guarantee backward compatibility as modifications may be made pending feedback." That said, let's give it a try!
If any changes are necessary, I expect it would likely relate to how we handle interactions between different types. That's where we spent the majority of the design effort, but debate is a poor substitute for experience. I would be very surprised if the basic cases (one argument or two arguments of the same type) need any changes.
Best, Stephan
On Tue, Oct 31, 2017 at 3:15 PM, William Sheffler <willsheffler@gmail.com> wrote:
Thank you all kindly for your responses! Based on your encouragement, I will pursue an ndarray subclass / __array_ufunc__ implementation. I had been toying with np.set_numeric_ops, which is less than ideal (for example, np.ndarray.around segfaults if I use set_numeric_ops in any way).
A second question: very broadly speaking, how much 'pain' can I expect trying to use an np.ndarray subclass in the broader python scientific computing ecosystem, and is there general consensus that projects 'should' support ndarray subclasses?
That depends on what the ndarray subclass does, which methods it overrides, and what the function uses. My guess is that most general code uses np.asarray and then assume it behaves like an ndarray, the actual behavior will be what non ufuncs are doing with it, e.g. what does np.linalg.pinv(my_array) @ np.ones(len(my_array)) return. Josef
Will
We spent a *long time* sorting out the messy details of __array_ufunc__ [1], especially for handling interactions between different types, e.g., between numpy arrays, non-numpy array-like objects, builtin Python objects, objects that override arithmetic to act in non-numpy-like ways, and of course subclasses of all the above.
We hope that we have it right this time, but as we wrote in the NumPy 1.13 release notes "The API is provisional, we do not yet guarantee backward compatibility as modifications may be made pending feedback." That said, let's give it a try!
If any changes are necessary, I expect it would likely relate to how we handle interactions between different types. That's where we spent the majority of the design effort, but debate is a poor substitute for experience. I would be very surprised if the basic cases (one argument or two arguments of the same type) need any changes.
Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
From my experience with Quantity, routines that properly ducktype work well, those that feel the need to accept list and blatantly do `asarray` do not - even if in many cases they would have worked if they used `asanyarray`... But there are lots of nice surprises, with, e.g., `np.fft.fftfreq` just working as one would hope. Anyway, bottom line, I think you should let this stop you from trying only if you know something important does not work. -- Marten
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now. See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into: https://www.youtube.com/watch?v=qCo9bkT9sow On Wed, Nov 1, 2017 at 5:50 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
From my experience with Quantity, routines that properly ducktype work well, those that feel the need to accept list and blatantly do `asarray` do not - even if in many cases they would have worked if they used `asanyarray`... But there are lots of nice surprises, with, e.g., `np.fft.fftfreq` just working as one would hope. Anyway, bottom line, I think you should let this stop you from trying only if you know something important does not work. -- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation) What are units of covariance or correlation between two variables with the same units, and what are they between variables with different units? How do you concatenate and operate arrays with different units? interpolation or prediction would work with using the existing units. partially related: statsmodels uses a wrapper for pandas Series and DataFrames and tries to preserve the index when possible and make up a new DataFrame or Series if the existing index doesn't apply. E.g. predicted values and residuals are in terms of the original provided index, and could also get original units assigned. That would also be possible with prediction confidence intervals. But for the rest, see above. Josef
On Wed, Nov 1, 2017 at 5:50 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
From my experience with Quantity, routines that properly ducktype work well, those that feel the need to accept list and blatantly do `asarray` do not - even if in many cases they would have worked if they used `asanyarray`... But there are lots of nice surprises, with, e.g., `np.fft.fftfreq` just working as one would hope. Anyway, bottom line, I think you should let this stop you from trying only if you know something important does not work. -- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, Nov 2, 2017 at 8:46 AM, <josef.pktd@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation) What are units of covariance or correlation between two variables with the same units, and what are they between variables with different units?
How do you concatenate and operate arrays with different units?
interpolation or prediction would work with using the existing units.
partially related: statsmodels uses a wrapper for pandas Series and DataFrames and tries to preserve the index when possible and make up a new DataFrame or Series if the existing index doesn't apply. E.g. predicted values and residuals are in terms of the original provided index, and could also get original units assigned. That would also be possible with prediction confidence intervals. But for the rest, see above.
using pint
x <Quantity([0 1 2 3 4], 'meter')> x / x <Quantity([ nan 1. 1. 1. 1.], 'dimensionless')>
x / (1 + x) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 669, in __add__ raise DimensionalityError(self._units, 'dimensionless') return self._add_sub(other, operator.add) File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 580, in _add_sub pint.errors.DimensionalityError: Cannot convert from 'meter' to 'dimensionless'
np.exp(x) raises pint.errors.DimensionalityError: Cannot convert from 'meter' ([length]) to 'dimensionless' (dimensionless) Josef
Josef
On Wed, Nov 1, 2017 at 5:50 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
From my experience with Quantity, routines that properly ducktype work well, those that feel the need to accept list and blatantly do `asarray` do not - even if in many cases they would have worked if they used `asanyarray`... But there are lots of nice surprises, with, e.g., `np.fft.fftfreq` just working as one would hope. Anyway, bottom line, I think you should let this stop you from trying only if you know something important does not work. -- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi Josef, astropy's Quantity is well developed and would give similar results to pint; all those results make sense if one wants to have consistent units. A general library code will actually do the right thing as long as it just uses normal mathematical operations with ufuncs - and as long as it just duck types! - the unit code will then override and properly propagate units to outputs, as can be seen in this example: ``` import astropy.units as u np.fft.fftfreq(8, 1*u.min) # <Quantity [ 0. , 0.125, 0.25 , 0.375,-0.5 ,-0.375,-0.25 ,-0.125] 1 / min> np.fft.fftfreq(8, 1*u.min).var() # <Quantity 0.08203125 1 / min2> ```
for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation)
The units module will force you to take into account `dt`! This is in fact one reason why it is so powerful. So, your example might go something like: ``` flow = [1., 1.5, 1.5] * u.g / u.s dt = [0.5, 0.5, 1.] * u.hr np.sum(flow * dt) # <Quantity 2.75 g h / s> np.sum(flow * dt).to(u.kg) # <Quantity 9.9 kg> ```
How do you concatenate and operate arrays with different units?
This is where Nathaniel's `__array_concatenate__` would come in. For regular arrays it is fine to just concatenate, but for almost anything else you need a different approach. For quantities, the most logical one would be to first create an empty array of the right size with the unit of, e.g., the first part to be concatenated, and then set sections to the input quantities (where the setter does unit conversion and will fail if that is not possible). All the best, Marten p.s. A fun subject is what to do with logarithmic units, such as the magnitudes in astronomy... We have a module for that as well; http://docs.astropy.org/en/latest/units/logarithmic_units.html
On Thu, Nov 2, 2017 at 11:51 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Josef,
astropy's Quantity is well developed and would give similar results to pint; all those results make sense if one wants to have consistent units. A general library code will actually do the right thing as long as it just uses normal mathematical operations with ufuncs - and as long as it just duck types! - the unit code will then override and properly propagate units to outputs, as can be seen in this example: ``` import astropy.units as u np.fft.fftfreq(8, 1*u.min) # <Quantity [ 0. , 0.125, 0.25 , 0.375,-0.5 ,-0.375,-0.25 ,-0.125] 1 / min> np.fft.fftfreq(8, 1*u.min).var() # <Quantity 0.08203125 1 / min2> ```
for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation)
The units module will force you to take into account `dt`! This is in fact one reason why it is so powerful. So, your example might go something like: ``` flow = [1., 1.5, 1.5] * u.g / u.s dt = [0.5, 0.5, 1.] * u.hr np.sum(flow * dt) # <Quantity 2.75 g h / s> np.sum(flow * dt).to(u.kg) # <Quantity 9.9 kg> ```
How do you concatenate and operate arrays with different units?
This is where Nathaniel's `__array_concatenate__` would come in. For regular arrays it is fine to just concatenate, but for almost anything else you need a different approach. For quantities, the most logical one would be to first create an empty array of the right size with the unit of, e.g., the first part to be concatenated, and then set sections to the input quantities (where the setter does unit conversion and will fail if that is not possible).
For example, "will fail if that is not possible" rules out inhomogeneous arrays (analogous to structure dtypes) How to you get a vander matrix for something simple like a polynomial fit? x[:, None] ** np.arange(3)
All the best,
Marten
p.s. A fun subject is what to do with logarithmic units, such as the magnitudes in astronomy... We have a module for that as well; http://docs.astropy.org/en/latest/units/logarithmic_units.html
similar, scipy.special has ufuncs what units are those? Most code that I know (i.e. scipy.stats and statsmodels) does not use only "normal mathematical operations with ufuncs" I guess there are a lot of "abnormal" mathematical operations where just simply propagating the units will not work. Aside: The problem is more general also for other datastructures. E.g. statsmodels for most parts uses only numpy ndarrays inside the algorithm and computations because that provides well defined behavior. (e.g. pandas behaved too differently in many cases) I don't have much idea yet about how to change the infrastructure to allow the use of dask arrays, sparse matrices and similar and possibly automatic differentiation. Josef
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, Nov 2, 2017 at 9:45 AM <josef.pktd@gmail.com> wrote:
similar, scipy.special has ufuncs what units are those?
Most code that I know (i.e. scipy.stats and statsmodels) does not use only "normal mathematical operations with ufuncs" I guess there are a lot of "abnormal" mathematical operations where just simply propagating the units will not work.
Aside: The problem is more general also for other datastructures. E.g. statsmodels for most parts uses only numpy ndarrays inside the algorithm and computations because that provides well defined behavior. (e.g. pandas behaved too differently in many cases) I don't have much idea yet about how to change the infrastructure to allow the use of dask arrays, sparse matrices and similar and possibly automatic differentiation.
This is the exact same reason why pandas and xarray do not support wrapping arbitrary ndarray subclasses or duck array types. The operations we use internally (on numpy.ndarray objects) may not be what you would expect externally, and may even be implementation details not considered part of the public API. For example, in xarray we use numpy.nanmean() or bottleneck.nanmean() instead of numpy.mean(). For NumPy and xarray, I think we could (and should) define an interface to support subclasses and duck types for generic operations for core use-cases. My main concern with subclasses / duck-arrays is undefined/untested behavior, especially where we might silently give the wrong answer or trigger some undesired operation (e.g., loading a lazily computed into memory) rather than raising an informative error. Leaking implementation details is another concern: we have already had several cases in NumPy where a function only worked on a subclass if a particular method was called internally, and broke when that was changed.
On Thu, Nov 2, 2017 at 2:37 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Nov 2, 2017 at 9:45 AM <josef.pktd@gmail.com> wrote:
similar, scipy.special has ufuncs what units are those?
Most code that I know (i.e. scipy.stats and statsmodels) does not use only "normal mathematical operations with ufuncs" I guess there are a lot of "abnormal" mathematical operations where just simply propagating the units will not work.
Aside: The problem is more general also for other datastructures. E.g. statsmodels for most parts uses only numpy ndarrays inside the algorithm and computations because that provides well defined behavior. (e.g. pandas behaved too differently in many cases) I don't have much idea yet about how to change the infrastructure to allow the use of dask arrays, sparse matrices and similar and possibly automatic differentiation.
This is the exact same reason why pandas and xarray do not support wrapping arbitrary ndarray subclasses or duck array types. The operations we use internally (on numpy.ndarray objects) may not be what you would expect externally, and may even be implementation details not considered part of the public API. For example, in xarray we use numpy.nanmean() or bottleneck.nanmean() instead of numpy.mean().
For NumPy and xarray, I think we could (and should) define an interface to support subclasses and duck types for generic operations for core use-cases. My main concern with subclasses / duck-arrays is undefined/untested behavior, especially where we might silently give the wrong answer or trigger some undesired operation (e.g., loading a lazily computed into memory) rather than raising an informative error. Leaking implementation details is another concern: we have already had several cases in NumPy where a function only worked on a subclass if a particular method was called internally, and broke when that was changed.
Would this issue be ameliorated given Nathaniel's proposal to try to move away from subclasses and towards storing data in dtypes? Or would that just mean that xarray would need to ban dtypes it doesn't know about?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Numpy already does support a specific unit, datetime64 and timedelta64, think through that very mechanism. Its also probably the most complicated unit since at least there is no such thing as leap meters. And it works well and is very useful IMHO On Thu, Nov 2, 2017 at 3:40 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
On Thu, Nov 2, 2017 at 2:37 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Nov 2, 2017 at 9:45 AM <josef.pktd@gmail.com> wrote:
similar, scipy.special has ufuncs what units are those?
Most code that I know (i.e. scipy.stats and statsmodels) does not use only "normal mathematical operations with ufuncs" I guess there are a lot of "abnormal" mathematical operations where just simply propagating the units will not work.
Aside: The problem is more general also for other datastructures. E.g. statsmodels for most parts uses only numpy ndarrays inside the algorithm and computations because that provides well defined behavior. (e.g. pandas behaved too differently in many cases) I don't have much idea yet about how to change the infrastructure to allow the use of dask arrays, sparse matrices and similar and possibly automatic differentiation.
This is the exact same reason why pandas and xarray do not support wrapping arbitrary ndarray subclasses or duck array types. The operations we use internally (on numpy.ndarray objects) may not be what you would expect externally, and may even be implementation details not considered part of the public API. For example, in xarray we use numpy.nanmean() or bottleneck.nanmean() instead of numpy.mean().
For NumPy and xarray, I think we could (and should) define an interface to support subclasses and duck types for generic operations for core use-cases. My main concern with subclasses / duck-arrays is undefined/untested behavior, especially where we might silently give the wrong answer or trigger some undesired operation (e.g., loading a lazily computed into memory) rather than raising an informative error. Leaking implementation details is another concern: we have already had several cases in NumPy where a function only worked on a subclass if a particular method was called internally, and broke when that was changed.
Would this issue be ameliorated given Nathaniel's proposal to try to move away from subclasses and towards storing data in dtypes? Or would that just mean that xarray would need to ban dtypes it doesn't know about?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
My 2¢ here is that all code should feel free to assume certain type of input, as long as it is documented properly, but there is no reason to enforce that by, e.g., putting `asarray` everywhere. Then, for some pieces ducktypes and subclasses will just work like magic, and uses you might never have foreseen become possible. For others, whoever wants to use them has to do work (and up to a package maintainers to decide whether or not to accept PRs that implement hooks, etc.) I do see the argument that this way one becomes constrained in the internal implementation, as a change may break an outward-looking function, but while at times this may be inconvenient, in my experience at others it may just make one realize an even better implementation is possible. But then, I really like duck-typing... -- Marten
Duck typing is great and all for classes that implement some or all of the ndarray interface.... but remember what the main reason for asarray() and asanyarray(): to automatically promote lists and tuples and other "array-likes" to ndarrays. Ignoring the use-case of lists of lists is problematic at best. Ben Root On Thu, Nov 2, 2017 at 5:05 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
My 2¢ here is that all code should feel free to assume certain type of input, as long as it is documented properly, but there is no reason to enforce that by, e.g., putting `asarray` everywhere. Then, for some pieces ducktypes and subclasses will just work like magic, and uses you might never have foreseen become possible. For others, whoever wants to use them has to do work (and up to a package maintainers to decide whether or not to accept PRs that implement hooks, etc.)
I do see the argument that this way one becomes constrained in the internal implementation, as a change may break an outward-looking function, but while at times this may be inconvenient, in my experience at others it may just make one realize an even better implementation is possible. But then, I really like duck-typing...
-- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, Nov 2, 2017 at 5:09 PM, Benjamin Root <ben.v.root@gmail.com> wrote:
Duck typing is great and all for classes that implement some or all of the ndarray interface.... but remember what the main reason for asarray() and asanyarray(): to automatically promote lists and tuples and other "array-likes" to ndarrays. Ignoring the use-case of lists of lists is problematic at best.
How I wish numpy had never gone there! Convenience for what, exactly? For the user not having to put `array()` around the list themselves? We slow down everything for that? And even now we're trying to remove some of the cases where both tuples and lists are allowed. Grrrrrr. Of course, we are well and truly stuck with it - now it is one of the main reasons to subclass rather than duck-type... Anyway, water under the bridge... -- Marten
On Thu, Nov 2, 2017 at 5:09 PM, Benjamin Root <ben.v.root@gmail.com> wrote:
Duck typing is great and all for classes that implement some or all of the ndarray interface.... but remember what the main reason for asarray() and asanyarray(): to automatically promote lists and tuples and other "array-likes" to ndarrays. Ignoring the use-case of lists of lists is problematic at best.
Ben Root
On Thu, Nov 2, 2017 at 5:05 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
My 2¢ here is that all code should feel free to assume certain type of input, as long as it is documented properly, but there is no reason to enforce that by, e.g., putting `asarray` everywhere. Then, for some pieces ducktypes and subclasses will just work like magic, and uses you might never have foreseen become possible. For others, whoever wants to use them has to do work (and up to a package maintainers to decide whether or not to accept PRs that implement hooks, etc.)
I do see the argument that this way one becomes constrained in the internal implementation, as a change may break an outward-looking function, but while at times this may be inconvenient, in my experience at others it may just make one realize an even better implementation is possible. But then, I really like duck-typing...
One problem in general is that there is no protocol about what operations are implemented in a numpy ndarray equivalent way in those ducks, i.e. if they quack in a compatible way. One small example, pandas standard deviation, std, used by default ddof=1, and didn't have an option to override it instead of using ddof=0 that numpy uses. So even though we could call a std method of the ducks, the t-test results would be a bit different and visibly different in small samples depending on the type of the data. A possible alternative would be to compute std from scratch and forgo the available function or method. I tried once in the scipy.zscore function to be agnostic about the type and not use asarray, it's a simple operation but still it required special handling of numpy matrices because it preserves the dimension in reduce operations. After more than a few lines it is difficult to keep track of what type is no used. Another subclass that is often broken in default code are masked arrays because asarray throws away the mask. But asanyarray wouldn't work always either because the mask needs code for handling the masked values. For example scipy.stats ended up with separate functions for masked arrays. Josef
-- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, Nov 2, 2017 at 12:42 PM Nathan Goldbaum <nathan12343@gmail.com> wrote:
Would this issue be ameliorated given Nathaniel's proposal to try to move away from subclasses and towards storing data in dtypes? Or would that just mean that xarray would need to ban dtypes it doesn't know about?
Yes, I think custom dtypes would definitely help. Custom dtypes have a well contained interface, so lots of operations (e.g., concatenate, reshaping, indexing) are guaranteed to work in a dtype independent way. If you try to do an unsupported operation for such a dtype (e.g., np.datetime64), you will generally get a good error message about an invalid dtype. In contrast, you can overload a subclass with totally arbitrary semantics (e.g., np.matrix) and of course for duck-types as well. This makes a big difference for libraries like dask or xarray, which need a standard interface to guarantee they do the right thing. I'm pretty sure we can wrap a custom dtype ndarray with units, but there's no way we're going to support np.matrix without significant work. It's hard to know which is which without well defined interfaces.
On Thu, Nov 2, 2017 at 5:21 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Nov 2, 2017 at 12:42 PM Nathan Goldbaum <nathan12343@gmail.com> wrote:
Would this issue be ameliorated given Nathaniel's proposal to try to move away from subclasses and towards storing data in dtypes? Or would that just mean that xarray would need to ban dtypes it doesn't know about?
Yes, I think custom dtypes would definitely help. Custom dtypes have a well contained interface, so lots of operations (e.g., concatenate, reshaping, indexing) are guaranteed to work in a dtype independent way. If you try to do an unsupported operation for such a dtype (e.g., np.datetime64), you will generally get a good error message about an invalid dtype.
In contrast, you can overload a subclass with totally arbitrary semantics (e.g., np.matrix) and of course for duck-types as well.
This makes a big difference for libraries like dask or xarray, which need a standard interface to guarantee they do the right thing. I'm pretty sure we can wrap a custom dtype ndarray with units, but there's no way we're going to support np.matrix without significant work. It's hard to know which is which without well defined interfaces.
Ah, but what if the dtype modifies the interface? That might sound evil, but it's something that's been proposed. For example, if I wanted to replace yt's YTArray in a backward compatibile way with a dtype and just use plain ndarrays everywhere, the dtype would need to *at least* modify ndarray's API, adding e.g. to(), convert_to_unit(), a units attribute, and several other things. Of course if I don't care about backward compatibility I can just do all of these operations on the dtype object itself. However, I suspect whatever implementation of custom dtypes gets added to numpy, it will have the property that it can act like an arbitrary ndarray subclass otherwise libraries like yt, Pint, metpy, and astropy won't be able to switch to it. -Nathan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
I guess my argument boils down to it being better to state that a function only accepts arrays and happily let it break on, e.g., matrix, than use `asarray` to make a matrix into an array even though it really isn't. I do like the dtype ideas, but think I'd agree they're likely to come with their own problems. But just making new numerical types possible is interesting. -- Marten
Maybe the best of both worlds would require explicit opt-in for classes that shouldn't be coerced, e.g., xarray.register_data_type(MyArray) or maybe better yet ;) xarray.void_my_nonexistent_warranty_its_my_fault_if_my_buggy_duck_array_breaks_everything(MyArray) On Thu, Nov 2, 2017 at 3:39 PM Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
I guess my argument boils down to it being better to state that a function only accepts arrays and happily let it break on, e.g., matrix, than use `asarray` to make a matrix into an array even though it really isn't.
I do like the dtype ideas, but think I'd agree they're likely to come with their own problems. But just making new numerical types possible is interesting.
-- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 2 Nov 2017, at 22:39, Marten van Kerkwijk <m.h.vankerkwijk@gmail.com> wrote:
I guess my argument boils down to it being better to state that a function only accepts arrays and happily let it break on, e.g., matrix, than use `asarray` to make a matrix into an array even though it really isn’t.
I would support this attitude, the user can always call `asarray` when passing their data into the function if necessary, then they know upfront what the consequences will be. For my own ndarray subclass, I want it to behave exactly as a standard ndarray, but in addition I add some metadata and some functions that act on that, for example an affine transform and functions to convert between coordinate systems. The current numpy system of overriding __array_wrap__, __array_finalize__ and __new__ is great to allow the subclass and metadata to propagate through most basic operations. The problem is that many functions using `asarray` strip out all of this metadata and return a bare ndarray. My current solution is to implement an `inherit` method on my subclass which converts an ndarray and copies back all the metadata, which often looks like this: spec_data = data.inherit(np.fft.fft(data)) To use composition instead of inheritance would require me to forward every part of the ndarray API as is, which would be a great deal of work which in nearly every case would only achieve the same results as replacing `asarray` by `asanyarray` in various library functions. I don’t want to change the behaviour of the existing class, just to add some data and methods, and I can’t imagine I am alone in that. Ben
I do like the dtype ideas, but think I'd agree they're likely to come with their own problems. But just making new numerical types possible is interesting.
-- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi Ben, You just summarized excellently why I'm on a quest to change `asarray` to `asanyarray` within numy (or at least add a `subok` keyword for things like `broadcast_arrays`)! Obviously, this covers only ndarray subclasses, not duck types, though I guess in principle one could use the ABC registration mechanism mentioned above to let those types pass through. Returning to the original topic of the thread, with `__array_ufunc__` it now is even easier to keep track of your meta data for ufuncs, and has become possible to massage input data before the ufunc is called (rather than just the output). All the best, Marten
On Sat, Nov 4, 2017 at 6:47 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
You just summarized excellently why I'm on a quest to change `asarray` to `asanyarray` within numpy
+1 -- we should all be using asanyarray() most of the time. However a couple notes: asarray() pre-dates asanyarray() by a LOT. asanyarray was added to better handle subclasses, but there is a lot of legacy code out there. An legacy coders -- I know that I still usually use asarray without thinking about it -- sorry! Obviously, this covers only ndarray
subclasses, not duck types, though I guess in principle one could use the ABC registration mechanism mentioned above to let those types pass through.
The trick there is that what does it mean to be duck-typed to an ndarray? For may applications its' critical that the C API be the same, so duck-typing doesn't really apply. And in other cases, in only needs to support a small portion of the numpy API. IS essence, there are an almost infinite number of possible ABCs for an ndarray... For my part, I've been known to write custom "array_like" code -- it checks for the handful of methods I know I need to use, and tI test it against the small handful of duck-typed arrays that I know I want my code to work with. Klunky, and maybe we could come up with a standard way to do it and include that in numpy, but I'm not sure that ABCs are the way to do it. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Mon, Nov 6, 2017 at 3:18 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Klunky, and maybe we could come up with a standard way to do it and include that in numpy, but I'm not sure that ABCs are the way to do it.
ABCs are *absolutely* the way to go about it. It's the only way baked into the Python language itself that allows you to register a class for purposes of `isinstance` without needing to subclass--i.e. duck-typing. What's needed, though, is not just a single ABC. Some thought and design needs to go into segmenting the ndarray API to declare certain behaviors, just like was done for collections: https://docs.python.org/3/library/collections.abc.html You don't just have a single ABC declaring a collection, but rather "I am a mapping" or "I am a mutable sequence". It's more of a pain for developers to properly specify things, but this is not a bad thing to actually give code some thought. Ryan -- Ryan May
On Mon, Nov 6, 2017 at 2:29 PM Ryan May <rmay31@gmail.com> wrote:
On Mon, Nov 6, 2017 at 3:18 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Klunky, and maybe we could come up with a standard way to do it and include that in numpy, but I'm not sure that ABCs are the way to do it.
ABCs are *absolutely* the way to go about it. It's the only way baked into the Python language itself that allows you to register a class for purposes of `isinstance` without needing to subclass--i.e. duck-typing.
What's needed, though, is not just a single ABC. Some thought and design needs to go into segmenting the ndarray API to declare certain behaviors, just like was done for collections:
https://docs.python.org/3/library/collections.abc.html
You don't just have a single ABC declaring a collection, but rather "I am a mapping" or "I am a mutable sequence". It's more of a pain for developers to properly specify things, but this is not a bad thing to actually give code some thought.
I agree, it would be nice to nail down a hierarchy of duck-arrays, if possible. Although, there are quite a few options, so I don't know how doable this is. Any interest in opening up an issue on GitHub to discuss?
Well, to get the ball rolling a bit, the key thing that matplotlib needs to know is if `shape`, `reshape`, 'size', broadcasting, and logical indexing is respected. So, I see three possible abc's here: one for attribute access (things like `shape` and `size`) and another for shape manipulations (broadcasting and reshape, and assignment to .shape). And then a third abc for indexing support, although, I am not sure how that could get implemented... Cheers! Ben Root On Mon, Nov 6, 2017 at 7:28 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Nov 6, 2017 at 2:29 PM Ryan May <rmay31@gmail.com> wrote:
On Mon, Nov 6, 2017 at 3:18 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Klunky, and maybe we could come up with a standard way to do it and include that in numpy, but I'm not sure that ABCs are the way to do it.
ABCs are *absolutely* the way to go about it. It's the only way baked into the Python language itself that allows you to register a class for purposes of `isinstance` without needing to subclass--i.e. duck-typing.
What's needed, though, is not just a single ABC. Some thought and design needs to go into segmenting the ndarray API to declare certain behaviors, just like was done for collections:
https://docs.python.org/3/library/collections.abc.html
You don't just have a single ABC declaring a collection, but rather "I am a mapping" or "I am a mutable sequence". It's more of a pain for developers to properly specify things, but this is not a bad thing to actually give code some thought.
I agree, it would be nice to nail down a hierarchy of duck-arrays, if possible. Although, there are quite a few options, so I don't know how doable this is. Any interest in opening up an issue on GitHub to discuss?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi Benjamin, For the shapes and reshaping, I wrote an ShapedLikeNDArray mixin/ABC for astropy, which may be a useful starting point as it also provides a way to implement the methods ndarray uses to reshape and get elements: see https://github.com/astropy/astropy/blob/master/astropy/utils/misc.py#L863 All the best, Marten
On Mon, Nov 6, 2017 at 4:28 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
What's needed, though, is not just a single ABC. Some thought and design needs to go into segmenting the ndarray API to declare certain behaviors, just like was done for collections:
https://docs.python.org/3/library/collections.abc.html
You don't just have a single ABC declaring a collection, but rather "I am a mapping" or "I am a mutable sequence". It's more of a pain for developers to properly specify things, but this is not a bad thing to actually give code some thought.
I agree, it would be nice to nail down a hierarchy of duck-arrays, if possible. Although, there are quite a few options, so I don't know how doable this is.
Exactly -- there are an exponential amount of options...
Well, to get the ball rolling a bit, the key thing that matplotlib needs to know is if `shape`, `reshape`, 'size', broadcasting, and logical indexing is respected. So, I see three possible abc's here: one for attribute access (things like `shape` and `size`) and another for shape manipulations (broadcasting and reshape, and assignment to .shape).
I think we're going to get into an string of ABCs: ArrayLikeForMPL_ABC etc, etc.....
And then a third abc for indexing support, although, I am not sure how that could get implemented...
This is the really tricky one -- all ABCs really check is the existence of methods -- making sure they behave the same way is up to the developer of the ducktype. which is K, but will require discipline. But indexing, specifically fancy indexing, is another matter -- I'm not sure if there even a way with an ABC to check for what types of indexing are support, but we'd still have the problem with whether the semantics are the same! For example, I work with netcdf variable objects, which are partly duck-typed as ndarrays, but I think n-dimensional fancy indexing works differently... how in the world do you detect that with an ABC??? For the shapes and reshaping, I wrote an ShapedLikeNDArray mixin/ABC
for astropy, which may be a useful starting point as it also provides a way to implement the methods ndarray uses to reshape and get elements: see https://github.com/astropy/astropy/blob/master/astropy/utils/misc.py#L863
Sounds like a good starting point for discussion. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Nov 7, 2017 at 1:20 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Mon, Nov 6, 2017 at 4:28 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
What's needed, though, is not just a single ABC. Some thought and design needs to go into segmenting the ndarray API to declare certain behaviors, just like was done for collections:
https://docs.python.org/3/library/collections.abc.html
You don't just have a single ABC declaring a collection, but rather "I am a mapping" or "I am a mutable sequence". It's more of a pain for developers to properly specify things, but this is not a bad thing to actually give code some thought.
I agree, it would be nice to nail down a hierarchy of duck-arrays, if possible. Although, there are quite a few options, so I don't know how doable this is.
Exactly -- there are an exponential amount of options...
Well, to get the ball rolling a bit, the key thing that matplotlib needs to know is if `shape`, `reshape`, 'size', broadcasting, and logical indexing is respected. So, I see three possible abc's here: one for attribute access (things like `shape` and `size`) and another for shape manipulations (broadcasting and reshape, and assignment to .shape).
I think we're going to get into an string of ABCs:
ArrayLikeForMPL_ABC
etc, etc.....
Only if you try to provide perfectly-sized options for every occasion--but that's not how we do things in (sane) software development. You provide a few options that optimize the common use cases, and you don't try to cover everything--let client code figure out the right combination from the primitives you provide. One can always just inherit/register *all* the ABCs if need be. The status quo is that we have 1 interface that covers everything from multiple dims and shape to math and broadcasting to the entire __array__ interface. Even breaking that up into the 3 "obvious" chunks would be a massive improvement. I just don't want to see this effort bog down into "this is so hard". Getting it perfect is hard; getting it useful is much easier. It's important to note that we can always break up/combine existing ABCs into other ones later.
And then a third abc for indexing support, although, I am not sure how
that could get implemented...
This is the really tricky one -- all ABCs really check is the existence of methods -- making sure they behave the same way is up to the developer of the ducktype.
which is K, but will require discipline.
But indexing, specifically fancy indexing, is another matter -- I'm not sure if there even a way with an ABC to check for what types of indexing are support, but we'd still have the problem with whether the semantics are the same!
For example, I work with netcdf variable objects, which are partly duck-typed as ndarrays, but I think n-dimensional fancy indexing works differently... how in the world do you detect that with an ABC???
Even documenting expected behavior as part of these ABCs would go a long way towards helping standardize behavior. Another idea would be to put together a conformance test suite as part of this effort, in lieu of some kind of run-time checking of behavior (which would be terrible). That would help developers of other "ducks" check that they're doing the right things. I'd imagine the existing NumPy test suite would largely cover this. Ryan -- Ryan May
On Tue, Nov 7, 2017 at 12:23 PM Chris Barker <chris.barker@noaa.gov> wrote:
And then a third abc for indexing support, although, I am not sure how
that could get implemented...
This is the really tricky one -- all ABCs really check is the existence of methods -- making sure they behave the same way is up to the developer of the ducktype.
which is K, but will require discipline.
But indexing, specifically fancy indexing, is another matter -- I'm not sure if there even a way with an ABC to check for what types of indexing are support, but we'd still have the problem with whether the semantics are the same!
For example, I work with netcdf variable objects, which are partly duck-typed as ndarrays, but I think n-dimensional fancy indexing works differently... how in the world do you detect that with an ABC???
We recently worked out a hierarchy of indexing types for xarray. To a crude approximation, we have: - "Basic" indexing support for slices and integers. Nearly every array type satisfies this. - "Outer" or "orthogonal" indexing with slices, integers and 1D arrays. This is what netCDF4-Python and Fortran/MATLAB support. - "Vectorized" indexing with broadcasting and multi-dimensional indexers. NumPy supports a generalization of this, but I would not wish the edge cases involving mixed slices/arrays upon anyone. - "Logical" indexing by a boolean array with the same shape. - "Exactly like NumPy" for subclasses or wrappers around NumPy arrays. There's some ambiguities in this, but that's what specs are for. For most applications, we probably don't need most of these: ABCs for "Basic", "Logical" and "Exactly like NumPy" would go a long ways.
On Nov 6, 2017 4:19 PM, "Chris Barker" <chris.barker@noaa.gov> wrote: On Sat, Nov 4, 2017 at 6:47 AM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
You just summarized excellently why I'm on a quest to change `asarray` to `asanyarray` within numpy
+1 -- we should all be using asanyarray() most of the time. The problem is that if you use 'asanyarray', then you're claiming that your code works correctly for: - regular ndarrays - np.matrix - np.ma masked arrays - and every third party subclass, regardless of their semantics, regardless of whether you've heard of them or not If subclasses followed the Liskov substitution principle, and had different internal implementations but the same public ("duck") API, then this would be fine. But in practice, numpy limitations mean that ndarrays subclasses have to have the same internal implementation, so the only reason to make an ndarray subclass is if you want to make something with a different public API. Basically the whole system is designed for subclasses to be incompatible. The end result is that if you use asanyarray, your code is definitely wrong, because there's no way you're actually doing the right thing for arbitrary ndarray subclasses. But if you don't use asanyarray, then yeah, that's also wrong, because it won't work on mostly-compatible subclasses like astropy's. Given this, different projects reasonably make different choices -- it's not just legacy code that uses asarray. In the long run we obviously need to come up with new options that don't have these tradeoffs (that's why we want to let units to to dtypes, implement methods like __array_ufunc__ to enable duck arrays, etc.) let's try to be sympathetic to other projects that are doing their best :-). -n
Hi Nathaniel, You're right, I shouldn't be righteous. Though I do think the advantage of `asanyarray` inside numpy is remains that it is easy for a user to add `asarray` to their input to a numpy function, and not easy for a happily compatible subclass to avoid an `asarray` inside a numpy function! I.e., coerce as little as you can get away with... All the best, Marten
On Thu, Nov 2, 2017 at 3:35 PM Nathan Goldbaum <nathan12343@gmail.com> wrote:
Ah, but what if the dtype modifies the interface? That might sound evil, but it's something that's been proposed. For example, if I wanted to replace yt's YTArray in a backward compatibile way with a dtype and just use plain ndarrays everywhere, the dtype would need to *at least* modify ndarray's API, adding e.g. to(), convert_to_unit(), a units attribute, and several other things.
I suppose we'll need to sort this out. But adding new methods/properties feels pretty safe to me, as long as existing ones are guaranteed to work in the same way.
Yes, I like the idea of, effectively, creating an ABC for ndarray - with which one can register. -- Marten
On Thu, Nov 2, 2017 at 6:56 AM, <josef.pktd@gmail.com> wrote:
On Thu, Nov 2, 2017 at 8:46 AM, <josef.pktd@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation) What are units of covariance or correlation between two variables with the same units, and what are they between variables with different units?
How do you concatenate and operate arrays with different units?
interpolation or prediction would work with using the existing units.
partially related: statsmodels uses a wrapper for pandas Series and DataFrames and tries to preserve the index when possible and make up a new DataFrame or Series if the existing index doesn't apply. E.g. predicted values and residuals are in terms of the original provided index, and could also get original units assigned. That would also be possible with prediction confidence intervals. But for the rest, see above.
using pint
x <Quantity([0 1 2 3 4], 'meter')> x / x <Quantity([ nan 1. 1. 1. 1.], 'dimensionless')>
x / (1 + x) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 669, in __add__ raise DimensionalityError(self._units, 'dimensionless') return self._add_sub(other, operator.add) File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 580, in _add_sub pint.errors.DimensionalityError: Cannot convert from 'meter' to 'dimensionless'
I'm not sure why you have a problem with that results. You tried to take a number in meters and add a dimensionless value to that--that's not a defined operation. That's like saying: "I have a distance of 12 meters and added 1 to it." 1 what? 1 meter? Great. 1 centimeter? I need to convert, but I can do that operation. 1 second? That makes no sense. If you add units to the 1 then it's a defined operation:
reg = pint.UnitRegistry() x / (1 * ureg.meters + x) <Quantity([ 0. 0.5 0.66666667 0.75 0.8 ], 'dimensionless')>
np.exp(x) raises pint.errors.DimensionalityError: Cannot convert from 'meter' ([length]) to 'dimensionless' (dimensionless)
Well, the Taylor series for exp (around a=0) is: exp(x) = 1 + x + x**2 / 2 + x**3 / 6 + ... so for that to properly add up, x needs to be dimensionless. It should be noted, though, that I've *never* seen a formula, theoretically derived or empirically fit, require directly taking exp(x) where x is a physical quantity with units. Instead, you have: f = a * exp(kx) Properly calculated values for a, k will have appropriate units attached to them that allows the calculation to proceed without error Ryan -- Ryan May
On Thu, Nov 2, 2017 at 12:46 PM, Ryan May <rmay31@gmail.com> wrote:
On Thu, Nov 2, 2017 at 6:56 AM, <josef.pktd@gmail.com> wrote:
On Thu, Nov 2, 2017 at 8:46 AM, <josef.pktd@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation) What are units of covariance or correlation between two variables with the same units, and what are they between variables with different units?
How do you concatenate and operate arrays with different units?
interpolation or prediction would work with using the existing units.
partially related: statsmodels uses a wrapper for pandas Series and DataFrames and tries to preserve the index when possible and make up a new DataFrame or Series if the existing index doesn't apply. E.g. predicted values and residuals are in terms of the original provided index, and could also get original units assigned. That would also be possible with prediction confidence intervals. But for the rest, see above.
using pint
x <Quantity([0 1 2 3 4], 'meter')> x / x <Quantity([ nan 1. 1. 1. 1.], 'dimensionless')>
x / (1 + x) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 669, in __add__ raise DimensionalityError(self._units, 'dimensionless') return self._add_sub(other, operator.add) File "C:\...\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 580, in _add_sub pint.errors.DimensionalityError: Cannot convert from 'meter' to 'dimensionless'
I'm not sure why you have a problem with that results. You tried to take a number in meters and add a dimensionless value to that--that's not a defined operation. That's like saying: "I have a distance of 12 meters and added 1 to it." 1 what? 1 meter? Great. 1 centimeter? I need to convert, but I can do that operation. 1 second? That makes no sense.
If you add units to the 1 then it's a defined operation:
reg = pint.UnitRegistry() x / (1 * ureg.meters + x) <Quantity([ 0. 0.5 0.66666667 0.75 0.8 ], 'dimensionless')>
np.exp(x) raises pint.errors.DimensionalityError: Cannot convert from 'meter' ([length]) to 'dimensionless' (dimensionless)
Well, the Taylor series for exp (around a=0) is:
exp(x) = 1 + x + x**2 / 2 + x**3 / 6 + ...
so for that to properly add up, x needs to be dimensionless. It should be noted, though, that I've *never* seen a formula, theoretically derived or empirically fit, require directly taking exp(x) where x is a physical quantity with units. Instead, you have:
f = a * exp(kx)
Properly calculated values for a, k will have appropriate units attached to them that allows the calculation to proceed without error
I was thinking of a simple logit model to predict whether it rains tomorrow The Logit transformation for the probability is exp(k x) / (1 + exp(k x) where k is a parameter to search for in the optimization. x is a matrix with all predictors or explanatory variables which could all have different units. So it sounds to me if we drop asarray, then we just get exceptions or possibly strange results, or we have to introduce a unit that matches everything (like a joker card) for any constants that we are using. Josef
Ryan
-- Ryan May
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Hi Josef, Indeed, for some applications one would like to have different units for different parts of an array. And that means that, at present, the quantity implementations that we have are no good at storing, say, a covariance matrix involving parameters with different units, where thus each element of the covariance matrix has a different unit. I fear at present it would have to be an object array instead; other cases may be a bit easier to solve, by, e.g., allowing structured arrays with similarly structured units. I do note that actually doing it would clarify, e.g., what the axes in Vandermonde (spelling?) matrices mean. That said, there is truly an enormous benefit for checking units on "regular" operations. Spacecraft have missed Mars because people didn't do it properly... All the best, Marten p.s. The scipy functions should indeed be included in the ufuncs covered; there is a fairly long-standing issue for that in astropy...
On Thu, Nov 2, 2017 at 2:39 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
Hi Josef,
Indeed, for some applications one would like to have different units for different parts of an array. And that means that, at present, the quantity implementations that we have are no good at storing, say, a covariance matrix involving parameters with different units, where thus each element of the covariance matrix has a different unit. I fear at present it would have to be an object array instead; other cases may be a bit easier to solve, by, e.g., allowing structured arrays with similarly structured units. I do note that actually doing it would clarify, e.g., what the axes in Vandermonde (spelling?) matrices mean.
(I have problems remembering the spelling of proper names) np.vander and various polyvander functions/methods One point I wanted to make is that the units are overhead and irrelevant in the computation. It's the outcome that might have units. Eg. polyfit could use various underlying polynomials, e.g. numpy.polynomial.chebyshev.chebvander(...) and various linear algebra and projection versions, and the output would still be the same units. aside: I just found an interesting http://docs.astropy.org/en/latest/api/astropy.stats.biweight.biweight_midcov... is pairwise, but uses asanyarray e.g. using asarray (for robust scatter) https://github.com/statsmodels/statsmodels/pull/3230/files#diff-8fd46d3044db... I guess I would have problems replacing asarray by asanyarray. one last related one What's the inverse of a covariance matrix? It's just sum, multiplication and division (which I wouldn't remember), but for the computation is just np.linalg.inv or np.linalg.pinv which is a simple shortcut. Josef
That said, there is truly an enormous benefit for checking units on "regular" operations. Spacecraft have missed Mars because people didn't do it properly...
https://twitter.com/search?q=2%20unit%20tests.%200%20integration%20tests
All the best,
Marten
p.s. The scipy functions should indeed be included in the ufuncs covered; there is a fairly long-standing issue for that in astropy... _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Thu, Nov 2, 2017 at 6:46 AM, <josef.pktd@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation)
A general library doesn't have to do anything--just not do annoying things like isinstance() checks and calling np.asarray() everywhere. Honestly one of those is the core of most of the problems I run into. It's currently more complicated when doing things in compiled land, but that's implementation details, not any kind of fundamental incompatibility. For basic mathematical operations, units have perfectly well defined semantics that many of us encountered in an introductory physics or chemistry class: - Want to add or subtract two things? They need to have the same units; a units library can handle conversion provided they have the same dimensionality (e.g. length, time) - Multiplication/Divison: combine and cancel units ( m/s * s -> m) Everything else we do on a computer with data in some way boils down to: add, subtract, multiply, divide. Average keeps the same units -- it's just a sum and division by a unit-less constant Integration (in 1-D) involves *two* variables, your data as well as the time/space coordinates (or dx or dt); fundamentally it's a multiplication by dx and a summation. The units results then are e.g. data.units * dx.units. This works just like it does in Physics 101 where you integrate velocity (i.e. m/s) over time (e.g. s) and get displacement (e.g. m) What are units of covariance or correlation between two variables with the
same units, and what are they between variables with different units?
Well, covariance is subtracting the mean from each variable and multiplying the residuals; therefore the units for cov(x, y): (x.units - x.units) * (y.units - y.units) -> x.units * y.units Correlation takes covariance and divides by the product of the standard deviations, so that's: (x.units * y.units) / (x.units * y.units) -> dimensionless Which is what I'd expect for a correlation.
How do you concatenate and operate arrays with different units?
If all arrays have compatible dimensionality (say meters, inches, miles), you convert to one (say the first) and concatenate like normal. If they're not compatible, you error out.
interpolation or prediction would work with using the existing units.
I'm sure you wrote that thinking units didn't play a role, but the math behind those operations works perfectly fine with units, with things cancelling out properly to give the same units out as in. Ryan -- Ryan May
On Thu, Nov 2, 2017 at 12:23 PM, Ryan May <rmay31@gmail.com> wrote:
On Thu, Nov 2, 2017 at 6:46 AM, <josef.pktd@gmail.com> wrote:
On Wed, Nov 1, 2017 at 6:55 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
I think the biggest issues could be resolved if __array_concatenate__ were finished. Unfortunately I don't feel like I can take that on right now.
See Ryan May's talk at scipy about using an ndarray subclass for units and the issues he's run into:
Interesting talk, but I don't see how general library code should know what units the output has. for example if units are some flows per unit of time and we average, sum or integrate over time, then what are the new units? (e.g. pandas time aggregation)
A general library doesn't have to do anything--just not do annoying things like isinstance() checks and calling np.asarray() everywhere. Honestly one of those is the core of most of the problems I run into. It's currently more complicated when doing things in compiled land, but that's implementation details, not any kind of fundamental incompatibility.
For basic mathematical operations, units have perfectly well defined semantics that many of us encountered in an introductory physics or chemistry class: - Want to add or subtract two things? They need to have the same units; a units library can handle conversion provided they have the same dimensionality (e.g. length, time) - Multiplication/Divison: combine and cancel units ( m/s * s -> m)
Everything else we do on a computer with data in some way boils down to: add, subtract, multiply, divide.
Average keeps the same units -- it's just a sum and division by a unit-less constant Integration (in 1-D) involves *two* variables, your data as well as the time/space coordinates (or dx or dt); fundamentally it's a multiplication by dx and a summation. The units results then are e.g. data.units * dx.units. This works just like it does in Physics 101 where you integrate velocity (i.e. m/s) over time (e.g. s) and get displacement (e.g. m)
What are units of covariance or correlation between two variables with the
same units, and what are they between variables with different units?
Well, covariance is subtracting the mean from each variable and multiplying the residuals; therefore the units for cov(x, y):
(x.units - x.units) * (y.units - y.units) -> x.units * y.units
Correlation takes covariance and divides by the product of the standard deviations, so that's:
(x.units * y.units) / (x.units * y.units) -> dimensionless
Which is what I'd expect for a correlation.
How do you concatenate and operate arrays with different units?
If all arrays have compatible dimensionality (say meters, inches, miles), you convert to one (say the first) and concatenate like normal. If they're not compatible, you error out.
interpolation or prediction would work with using the existing units.
I'm sure you wrote that thinking units didn't play a role, but the math behind those operations works perfectly fine with units, with things cancelling out properly to give the same units out as in.
Some of it is in my reply to Marten. regression and polyfit requires an X matrix with different units and then some linear algebra like solve, pinv or svd. So, while the predicted values have well defined units, the computation involves some messier operations, unless you want to forgo linear algebra in all intermediate step and reduce it to sum, division and inverse. Josef
Ryan
-- Ryan May
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
participants (11)
-
Ben Rowland
-
Benjamin Root
-
Chris Barker
-
josef.pktd@gmail.com
-
Marten van Kerkwijk
-
Matthew Harrigan
-
Nathan Goldbaum
-
Nathaniel Smith
-
Ryan May
-
Stephan Hoyer
-
William Sheffler