Hello all,
We are making a decision (again) about what to do about the
behavior of multiple-field indexing of structured arrays: Should
it return a view or a copy, and on what release schedule?
As a reminder, this refers to operations like (1.13 behavior):
>>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')])
>>> a[['a', 'c']]
array([(0, 0.), (0, 0.), (0, 0.)],
dtype=[('a', '
Hi Allan, I think on the consistency argument is perhaps the most important: views are very powerful and in many ways one *counts* on them happening, especially in working with large arrays. They really should be used everywhere it is possible. In this respect, I think one has to weigh breakage of some code against time spent solving unexpected bugs because a view is *not* taken (the change in MaskedArray to ensure the mask is always viewed instead of copied is another example of trying to move as much as possible in that direction). Anyway, in favour of views. All the best, Marten
On Mon, 22 Jan 2018 10:11:08 -0500, Marten van Kerkwijk wrote:
I think on the consistency argument is perhaps the most important: views are very powerful and in many ways one *counts* on them happening, especially in working with large arrays.
I had the same gut feeling, but the fancy indexing example made me pause: In [9]: x = np.arange(12, dtype=float).reshape((3, 4)) In [10]: p = x[[0, 1]] # copy of data Then: In [11]: x = np.array([(0, 1), (2, 3)], dtype=[('a', int), ('b', int)]) In [12]: p = x[['a', 'b']] # copy of data, but proposal will change that We're not doing the same kind of indexing here exactly (in one case we grab elements, in the other parts of elements), but the view behavior may still break the "mental expectation". Fortunately, there's already other proof that this operatoin is not exactly fancy indexing: In [15]: x[['a', 'a']] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-15-3629f1d5c01d> in <module>() ----> 1 x[['a', 'a']] KeyError: 'duplicate field of name a' Not copying wherever possible feels like an important principle to uphold, so I am +1. Stéfan
On Thu, Jan 25, 2018 at 1:16 PM, Stefan van der Walt
On Mon, 22 Jan 2018 10:11:08 -0500, Marten van Kerkwijk wrote:
I think on the consistency argument is perhaps the most important: views are very powerful and in many ways one *counts* on them happening, especially in working with large arrays.
I had the same gut feeling, but the fancy indexing example made me pause:
In [9]: x = np.arange(12, dtype=float).reshape((3, 4))
In [10]: p = x[[0, 1]] # copy of data
Then:
In [11]: x = np.array([(0, 1), (2, 3)], dtype=[('a', int), ('b', int)])
In [12]: p = x[['a', 'b']] # copy of data, but proposal will change that
We're not doing the same kind of indexing here exactly (in one case we grab elements, in the other parts of elements), but the view behavior may still break the "mental expectation".
A bit off-topic, but maybe this is another argument to just allow `x['a', 'b']` -- I never understood why a tuple was not the appropriate iterable for getting multiple items from a record. -- Marten
On Thu, Jan 25, 2018 at 1:49 PM, Marten van Kerkwijk < m.h.vankerkwijk@gmail.com> wrote:
On Thu, Jan 25, 2018 at 1:16 PM, Stefan van der Walt
wrote: On Mon, 22 Jan 2018 10:11:08 -0500, Marten van Kerkwijk wrote:
I think on the consistency argument is perhaps the most important: views are very powerful and in many ways one *counts* on them happening, especially in working with large arrays.
I had the same gut feeling, but the fancy indexing example made me pause:
In [9]: x = np.arange(12, dtype=float).reshape((3, 4))
In [10]: p = x[[0, 1]] # copy of data
Then:
In [11]: x = np.array([(0, 1), (2, 3)], dtype=[('a', int), ('b', int)])
In [12]: p = x[['a', 'b']] # copy of data, but proposal will change that
What does this do? p = x[['a', 'b']].copy() My impression is that the problems with the view are because the padded view doesn't behave like a "standard" dtype or array, i.e. the follow-up behavior is the problematic part. Josef
We're not doing the same kind of indexing here exactly (in one case we grab elements, in the other parts of elements), but the view behavior may still break the "mental expectation".
A bit off-topic, but maybe this is another argument to just allow `x['a', 'b']` -- I never understood why a tuple was not the appropriate iterable for getting multiple items from a record.
-- Marten _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 01/25/2018 03:56 PM, josef.pktd@gmail.com wrote:
On Thu, Jan 25, 2018 at 1:49 PM, Marten van Kerkwijk
mailto:m.h.vankerkwijk@gmail.com> wrote: On Thu, Jan 25, 2018 at 1:16 PM, Stefan van der Walt
mailto:stefanv@berkeley.edu> wrote: > On Mon, 22 Jan 2018 10:11:08 -0500, Marten van Kerkwijk wrote: >> >> I think on the consistency argument is perhaps the most important: >> views are very powerful and in many ways one *counts* on them >> happening, especially in working with large arrays. > > > I had the same gut feeling, but the fancy indexing example made me > pause: > > In [9]: x = np.arange(12, dtype=float).reshape((3, 4)) > > In [10]: p = x[[0, 1]] # copy of data > > Then: > > In [11]: x = np.array([(0, 1), (2, 3)], dtype=[('a', int), ('b', int)]) > > In [12]: p = x[['a', 'b']] # copy of data, but proposal will change that What does this do? p = x[['a', 'b']].copy()
In 1.14.0 this creates an exact copy of what was returned by `x[['a', 'b']]`, including any padding bytes.
My impression is that the problems with the view are because the padded view doesn't behave like a "standard" dtype or array, i.e. the follow-up behavior is the problematic part.
I think the padded view is a "standard array" in the sense that you can
easily create structured arrays with padding bytes, for example by using
the `align=True` options.
>>> np.zeros(3, dtype=np.dtype('u1,f4', align=True))
array([(0, 0.), (0, 0.), (0, 0.)],
dtype={'names':['f0','f1'], 'formats':['u1',' Josef >
> We're not doing the same kind of indexing here exactly (in one case we
> grab elements, in the other parts of elements), but the view behavior
> may still break the "mental expectation". A bit off-topic, but maybe this is another argument to just allow
`x['a', 'b']` -- I never understood why a tuple was not the
appropriate iterable for getting multiple items from a record. -- Marten
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org mailto:NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
On Sun, Jan 21, 2018 at 9:48 PM, Allan Haldane
Hello all,
We are making a decision (again) about what to do about the behavior of multiple-field indexing of structured arrays: Should it return a view or a copy, and on what release schedule?
As a reminder, this refers to operations like (1.13 behavior):
>>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')]) >>> a[['a', 'c']] array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('a', '
In numpy 1.14.0 we made this return a view instead of a copy, but downstream test failures suggest we reconsider. In our current implementation for 1.14.1, we have reverted this change, but still plan to go through with it in 1.15.
See here for our discussion the problem and solutions: https://github.com/numpy/numpy/pull/10411
The two main options we have discussed are either to try to make the change in 1.15, or never make the change at all and always return a copy.
Here are some pros and cons:
Pros (change to view in 1.15) =============================
* Views are useful and convenient. Other forms of indexing also often return views so this is more consistent. * This change has been planned since numpy 1.7 in 2009, and there have been visible FutureWarnings about it since then. Anyone whose code will break should have seen the warnings. It has been extensively warned about in recent release notes. * Past discussions have supported the change. See my comment in the PR with many links to them and to other history. * Users have requested the change on the list. * Possibly a majority of the reported code failures were not actually caused by the change, but by another bug (#8100) involving np.load/np.save which this change exposed. If we push it off to 1.15, we will have time to fix this other bug. (There were no FutureWarnings for this breakage, of course). * The code that really will break is of the form a[['a', 'c']].view('i8') because the returned itemsize is different. This has raised FutureWarnings since numpy 1.7, and no users reported failures due to this change. In the PR we still try to mitigate this breakage by introducing a new method `pack_fields`, which converts the result into the 1.13 form, so that np.pack_fields(a[['a', 'c']]).view('i8') will work.
Cons (keep returning a copy) ============================
* The extra convenience is not really that much, and fancy indexing also returns a copy instead of a view, so there is a precedent there. * We want to minimize compatibility breaks with old behavior. We've had a fair amount of discussion and complaints about how we break things in general. * We have lived with a "copy" for 8 years now. At some point the behavior gets set in stone for compatibility reasons. * Users have written to the list and github about their code breaking in 1.14.0. As far as I am aware, they all refer to the #8100 problem. * If a new function `pack_fields` is needed to guard against mishaps with the view behavior, that seems like a sign that keeping the copy behavior is the best option from an API perspective.
My initial vote is go with the change in 1.15: The "view" code that will ultimately break (not the code related to #8100) has been sending FutureWarnings for many years, and I am not aware of any user complaints involving it: All the complaints so far would be fixed with #8100 in 1.15.
(Note based on a linked mailing list thread, 2012 might be the last time I looked more closely at structured dtypes. So some of what I understand might be outdated.) views on structured dtypes are very important, but viewing them as standard arrays with standard dtypes is the main part that I had used. Essentially structured dtypes are useless for any computation, e.g. just some simple reduce operation. To work with them we need a standard view. I think the usecase that fails in statsmodels (except there is no test failure anymore because we switched to using pandas in the unit test) cls.confint_res = cls.results[['acvar_lb','acvar_ub']].view((float,
2)) E ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged This is similar to the above example a[['a', 'c']].view('i8') but it doesn't try to combine fields. In many examples where I used structured dtypes a long time ago, switched between consistent views as either a standard array of subsets or as .structured dtypes. For this usecase it wouldn't matter whether a[['a', 'c']] returns a view or copy, as long as we can get the second view that is consistent with the selected part of the memory. This would also be independent of whether numpy pads internally and adjusts the strides if possible or not.
np.__version__ '1.11.2'
a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')]) a array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0)], dtype=[('a', '
a.mean(0) Traceback (most recent call last): File "
", line 1, in <module> a.mean(0) File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py", line 65, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) TypeError: cannot perform reduce with flexible type
a[['b', 'c']].mean(0) Traceback (most recent call last): File "
", line 1, in <module> a[['b', 'c']].mean(0) File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py", line 65, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) TypeError: cannot perform reduce with flexible type
a[['b', 'c']].view(('f8', 2)).mean(0) array([ 1., 1.]) a[['b', 'c']].view(('f8', 2)).dtype dtype('float64')
Aside The plan is that statsmodels will drop all usage and support for rec_arays/structured dtypes in the following release (0.10). Then structured dtypes are free (from our perspective) to provide low level struct support instead of pretending to be dataframe_like. Josef
Feel free to also discuss the related proposed change, to make np.diag return a view instead of a copy. That change has not been implemented yet, only proposed.
Cheers, Allan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On Mon, Jan 22, 2018 at 10:53 AM,
On Sun, Jan 21, 2018 at 9:48 PM, Allan Haldane
wrote: Hello all,
We are making a decision (again) about what to do about the behavior of multiple-field indexing of structured arrays: Should it return a view or a copy, and on what release schedule?
As a reminder, this refers to operations like (1.13 behavior):
>>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')]) >>> a[['a', 'c']] array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('a', '
In numpy 1.14.0 we made this return a view instead of a copy, but downstream test failures suggest we reconsider. In our current implementation for 1.14.1, we have reverted this change, but still plan to go through with it in 1.15.
See here for our discussion the problem and solutions: https://github.com/numpy/numpy/pull/10411
The two main options we have discussed are either to try to make the change in 1.15, or never make the change at all and always return a copy.
Here are some pros and cons:
Pros (change to view in 1.15) =============================
* Views are useful and convenient. Other forms of indexing also often return views so this is more consistent. * This change has been planned since numpy 1.7 in 2009, and there have been visible FutureWarnings about it since then. Anyone whose code will break should have seen the warnings. It has been extensively warned about in recent release notes. * Past discussions have supported the change. See my comment in the PR with many links to them and to other history. * Users have requested the change on the list. * Possibly a majority of the reported code failures were not actually caused by the change, but by another bug (#8100) involving np.load/np.save which this change exposed. If we push it off to 1.15, we will have time to fix this other bug. (There were no FutureWarnings for this breakage, of course). * The code that really will break is of the form a[['a', 'c']].view('i8') because the returned itemsize is different. This has raised FutureWarnings since numpy 1.7, and no users reported failures due to this change. In the PR we still try to mitigate this breakage by introducing a new method `pack_fields`, which converts the result into the 1.13 form, so that np.pack_fields(a[['a', 'c']]).view('i8') will work.
Cons (keep returning a copy) ============================
* The extra convenience is not really that much, and fancy indexing also returns a copy instead of a view, so there is a precedent there. * We want to minimize compatibility breaks with old behavior. We've had a fair amount of discussion and complaints about how we break things in general. * We have lived with a "copy" for 8 years now. At some point the behavior gets set in stone for compatibility reasons. * Users have written to the list and github about their code breaking in 1.14.0. As far as I am aware, they all refer to the #8100 problem. * If a new function `pack_fields` is needed to guard against mishaps with the view behavior, that seems like a sign that keeping the copy behavior is the best option from an API perspective.
My initial vote is go with the change in 1.15: The "view" code that will ultimately break (not the code related to #8100) has been sending FutureWarnings for many years, and I am not aware of any user complaints involving it: All the complaints so far would be fixed with #8100 in 1.15.
(Note based on a linked mailing list thread, 2012 might be the last time I looked more closely at structured dtypes. So some of what I understand might be outdated.)
views on structured dtypes are very important, but viewing them as standard arrays with standard dtypes is the main part that I had used. Essentially structured dtypes are useless for any computation, e.g. just some simple reduce operation. To work with them we need a standard view.
I think the usecase that fails in statsmodels (except there is no test failure anymore because we switched to using pandas in the unit test)
do add a detail here results is a recarray created from a csv file with results = genfromtxt(open(filename, "rb"), delimiter=",", names=True,dtype=float) ['acvar_lb','acvar_ub'] are the last two columns, so this corresponds to my example below where AFAIU no padding is necessary to get a view.
cls.confint_res = cls.results[['acvar_lb','acvar _ub']].view((float,
2)) E ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged
This is similar to the above example a[['a', 'c']].view('i8') but it doesn't try to combine fields.
In many examples where I used structured dtypes a long time ago, switched between consistent views as either a standard array of subsets or as .structured dtypes. For this usecase it wouldn't matter whether a[['a', 'c']] returns a view or copy, as long as we can get the second view that is consistent with the selected part of the memory. This would also be independent of whether numpy pads internally and adjusts the strides if possible or not.
np.__version__ '1.11.2'
a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')]) a array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0)], dtype=[('a', '
a.mean(0) Traceback (most recent call last): File "
", line 1, in <module> a.mean(0) File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py", line 65, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) TypeError: cannot perform reduce with flexible type a[['b', 'c']].mean(0) Traceback (most recent call last): File "
", line 1, in <module> a[['b', 'c']].mean(0) File "C:\...\python-3.4.4.amd64\lib\site-packages\numpy\core\_methods.py", line 65, in _mean ret = umr_sum(arr, axis, dtype, out, keepdims) TypeError: cannot perform reduce with flexible type a[['b', 'c']].view(('f8', 2)).mean(0) array([ 1., 1.]) a[['b', 'c']].view(('f8', 2)).dtype dtype('float64')
Aside The plan is that statsmodels will drop all usage and support for rec_arays/structured dtypes in the following release (0.10). Then structured dtypes are free (from our perspective) to provide low level struct support instead of pretending to be dataframe_like.
Josef
Feel free to also discuss the related proposed change, to make np.diag return a view instead of a copy. That change has not been implemented yet, only proposed.
Cheers, Allan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 01/22/2018 10:53 AM, josef.pktd@gmail.com wrote:
This is similar to the above example a[['a', 'c']].view('i8') but it doesn't try to combine fields.
In many examples where I used structured dtypes a long time ago, switched between consistent views as either a standard array of subsets or as .structured dtypes. For this usecase it wouldn't matter whether a[['a', 'c']] returns a view or copy, as long as we can get the second view that is consistent with the selected part of the memory. This would also be independent of whether numpy pads internally and adjusts the strides if possible or not.
np.__version__ '1.11.2'
a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')]) a array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0)], dtype=[('a', '
a[['b', 'c']].view(('f8', 2)).mean(0) array([ 1., 1.]) a[['b', 'c']].view(('f8', 2)).dtype dtype('float64')
Hmm, this did not raise a FutureWarning in 11.2, so I was not quite right in my message. It looks like this particular line only started raising FutureWarnings in 1.12.0.
Aside The plan is that statsmodels will drop all usage and support for rec_arays/structured dtypes in the following release (0.10). Then structured dtypes are free (from our perspective) to provide low level struct support instead of pretending to be dataframe_like.
Your use of structured arrays is "pandas-like", ie you are using it tabular data manipulation. In numpy 1.13 we updated the structured docs to discourage this. Of course users can do what they want, but here is what the new docs say: Structured arrays are designed for low-level manipulation of structured data, for example, for interpreting binary blobs. Structured datatypes are designed to mimic 'structs' in the C language, making them also useful for interfacing with C code. For these purposes, numpy supports specialized features such as subarrays and nested datatypes, and allows manual control over the memory layout of the structure. For simple manipulation of tabular data other pydata projects, such as pandas, xarray, or DataArray, provide higher-level interfaces that may be more suitable. These projects may also give better performance for tabular data analysis because the C-struct-like memory layout of structured arrays can lead to poor cache behavior. Allan
On Mon, Jan 22, 2018 at 11:13 AM, Allan Haldane
On 01/22/2018 10:53 AM, josef.pktd@gmail.com wrote:
This is similar to the above example a[['a', 'c']].view('i8') but it doesn't try to combine fields.
In many examples where I used structured dtypes a long time ago, switched between consistent views as either a standard array of subsets or as .structured dtypes. For this usecase it wouldn't matter whether a[['a', 'c']] returns a view or copy, as long as we can get the second view that is consistent with the selected part of the memory. This would also be independent of whether numpy pads internally and adjusts the strides if possible or not.
np.__version__
'1.11.2'
a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')])
a
array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0)], dtype=[('a', '
a[['b', 'c']].view(('f8', 2)).mean(0)
array([ 1., 1.])
a[['b', 'c']].view(('f8', 2)).dtype
dtype('float64')
Hmm, this did not raise a FutureWarning in 11.2, so I was not quite right in my message. It looks like this particular line only started raising FutureWarnings in 1.12.0.
Aside The plan is that statsmodels will drop all usage and support for
rec_arays/structured dtypes in the following release (0.10). Then structured dtypes are free (from our perspective) to provide low level struct support instead of pretending to be dataframe_like.
Your use of structured arrays is "pandas-like", ie you are using it tabular data manipulation. In numpy 1.13 we updated the structured docs to discourage this. Of course users can do what they want, but here is what the new docs say:
Structured arrays are designed for low-level manipulation of structured data, for example, for interpreting binary blobs. Structured datatypes are designed to mimic 'structs' in the C language, making them also useful for interfacing with C code. For these purposes, numpy supports specialized features such as subarrays and nested datatypes, and allows manual control over the memory layout of the structure.
For simple manipulation of tabular data other pydata projects, such as pandas, xarray, or DataArray, provide higher-level interfaces that may be more suitable. These projects may also give better performance for tabular data analysis because the C-struct-like memory layout of structured arrays can lead to poor cache behavior.
Once upon a time .... The test code was written in June 2010 In Oct/Nov 2017 we switched to pandas for loading the data but not for the reference `results` to avoid numpy recarray warnings. In Jan 2018 we switched to pandas also for the reference results statsmodels has a lot of "legacy" code especially in the datasets and unit tests, when recarrays were still the appropriate precursor to pandas. recarrays are built on structured dtypes, and were not just supposed to be low level C-structs. Josef
Allan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
On 01/22/2018 10:53 AM, josef.pktd@gmail.com wrote:
On Sun, Jan 21, 2018 at 9:48 PM, Allan Haldane
In many examples where I used structured dtypes a long time ago, switched between consistent views as either a standard array of subsets or as .structured dtypes. For this usecase it wouldn't matter whether a[['a', 'c']] returns a view or copy, as long as we can get the second view that is consistent with the selected part of the memory. This would also be independent of whether numpy pads internally and adjusts the strides if possible or not.
np.__version__ '1.11.2'
a = np.ones(5, dtype=[('a', 'i8'), ('b', 'f8'), ('c', 'f8')]) a array([(1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0), (1, 1.0, 1.0)], dtype=[('a', '
Thanks for a real example to think about. I just want to note that I thought of another way to "fix" this for 1.15 which does not involve "pack_fields", which is a[['b', 'c']].astype('f8,f8').view(('f8', 2)) Which is back-compatible will numpy back to 1.7, I think. So that's another option to ease the transition. Allan
On 01/22/2018 11:23 AM, Allan Haldane wrote:
I just want to note that I thought of another way to "fix" this for 1.15 which does not involve "pack_fields", which is
a[['b', 'c']].astype('f8,f8').view(('f8', 2))
Which is back-compatible will numpy back to 1.7, I think.
Apologies, this is not back-compatible, do not use it. I forgot that past versions of numpy had a weird quirk that this will replace all the structured data with 0s. Allan
participants (4)
-
Allan Haldane
-
josef.pktd@gmail.com
-
Marten van Kerkwijk
-
Stefan van der Walt