Setting custom dtypes and 1.14
Hi all, I'm pretty sure this is the same thing as recently discussed on this list about 1.14, but to confirm: I had failures in my code with an upgrade for 1.14  turns out it was a single line in a single test fixture, so no big deal, but a regression just the same, with no deprecation warning. I was essentially doing this: In [*48*]: dt Out[*48*]: dtype([('time', '<i8'), ('value', [('u', '<f8'), ('v', '<f8')])], align=True) In [*49*]: uv Out[*49*]: array([[1., 1.], [1., 1.], [1., 1.], [1., 1.]]) In [*50*]: time Out[*50*]: array([1, 1, 1, 1]) In [*51*]: full = np.array(zip(time, uv), dtype=dt)  ValueError Traceback (most recent call last) <ipythoninput51ed726f71dd4a> in <module>() > 1 full = np.array(zip(time, uv), dtype=dt) ValueError: setting an array element with a sequence. It took some poking, but the solution was to do: full = np.array(zip(time, (tuple(w) *for* w *in* uv)), dtype=dt) That is, convert the values to nested tuples, rather than an array in a tuple, or a list in a tuple. As I said, my problem is solved, but to confirm: 1) This is a known change with good reason? 2) My solution was the best (only) one  the only way to set a nested dtype like that is with tuples? If so, then I think we should: A) improve the error message. "ValueError: setting an array element with a sequence." Is not really clear  I spent a while trying to figure out how I could set a nested dtype like that without a sequence? and I was actually using a ndarray, so it wasn't even a generic sequence. And a tuple is a sequence, too... I had a vague recollection that in some circumstances, numpy treats tuples and lists (and arrays) differently (fancy indexing??), so I tried the tuple thing and that worked. But I've been around numpy a long time  that could have been very very confusing to many people. So could the message be changed to something like: "ValueError: setting an array element with a generic sequence. Only the tuple type can be used in this context." or something like that  I'm not sure where else this same error message might pop up, so that could be totally inappropriate. 2) maybe add a .totuple()method to ndarray, much like the .tolist() method? that would have been handy here. Chris  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
On 01/25/2018 06:06 PM, Chris Barker wrote:
Hi all,
I'm pretty sure this is the same thing as recently discussed on this list about 1.14, but to confirm:
I had failures in my code with an upgrade for 1.14  turns out it was a single line in a single test fixture, so no big deal, but a regression just the same, with no deprecation warning.
I was essentially doing this:
In [*48*]: dt
Out[*48*]: dtype([('time', '<i8'), ('value', [('u', '<f8'), ('v', '<f8')])], align=True)
In [*49*]: uv
Out[*49*]:
array([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]])
In [*50*]: time
Out[*50*]: array([1, 1, 1, 1])
In [*51*]: full = np.array(zip(time, uv), dtype=dt)

ValueError Traceback (most recent call last)
<ipythoninput51ed726f71dd4a>in <module>()
> 1full =np.array(zip(time,uv),dtype=dt)
ValueError: setting an array element with a sequence.
It took some poking, but the solution was to do:
full = np.array(zip(time, (tuple(w) *for*w *in*uv)), dtype=dt)
That is, convert the values to nested tuples, rather than an array in a tuple, or a list in a tuple.
As I said, my problem is solved, but to confirm:
1) This is a known change with good reason?
This change is a little different from what we discussed before. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought. If you modify your dtype above changing both 'f8' fields to 'f4', you will see you get very strange results: Your array gets filled in with the values (1, ( 0., 1.875)). Here's what happened: Previously, numpy was *not* iterating your data as a sequence. Instead, if numpy did not find a tuple it would interpret the data a a raw buffer and copy the value bytebybyte, ignoring endianness, casting, stride, etc. You can get even weirder results if you do `uv = uv.astype('i4')`, for example. It happened to work for you because ndarrays expose a buffer interface, and you were assigning using exactly the same type and endianness. In 1.14 the fix was to disallow this 'buffer' assignment for structured arrays, it was causing quite confusing bugs. Unstructured "void" arrays still do this though.
2) My solution was the best (only) one  the only way to set a nested dtype like that is with tuples?
Right, our solution was to only allow assignment from tuples. We might be able to relax that for structured scalars, but for arrays I remember one consideration was to avoid confusion with array broadcasting: If you do >>> x = np.zeros(2, dtype='i4,i4') >>> x[:] = np.array([3, 4]) >>> x array([(3, 3), (4, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')]) it might be the opposite of what you expect. Compare to >>> x[:] = (3, 4) >>> x array([(3, 4), (3, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')])
If so, then I think we should:
A) improve the error message.
"ValueError: setting an array element with a sequence."
Is not really clear  I spent a while trying to figure out how I could set a nested dtype like that without a sequence? and I was actually using a ndarray, so it wasn't even a generic sequence. And a tuple is a sequence, too...
I had a vague recollection that in some circumstances, numpy treats tuples and lists (and arrays) differently (fancy indexing??), so I tried the tuple thing and that worked. But I've been around numpy a long time  that could have been very very confusing to many people.
So could the message be changed to something like:
"ValueError: setting an array element with a generic sequence. Only the tuple type can be used in this context."
or something like that  I'm not sure where else this same error message might pop up, so that could be totally inappropriate.
Good idea. I'll see if we can do it for 1.14.1.
2) maybe add a .totuple()method to ndarray, much like the .tolist() method? that would have been handy here.
Chris

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 <tel:%28206%29%205266959> voice 7600 Sand Point Way NE (206) 5266329 <tel:%28206%29%205266329> fax Seattle, WA 98115 (206) 5266317 <tel:%28206%29%205266317> main reception
Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought.
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful. CHB
On 01/25/2018 08:53 PM, Chris Barker  NOAA Federal wrote:
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought.
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
CHB
Two thoughts: 1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1. 2. structured array's .tolist() already returns a list of tuples. If we have a 2d structured array, would it add one more layer of tuples? That would raise an exception if read back in by `np.array` with the same dtype. These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions. If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`? def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(1) Then you could do: >>> pack_last_axis(uv).tolist() to get a list of tuples. Allan
Why is the list of tuples a useful thing to have in the first place? If the goal is to convert an array into a structured array, you can do that far more efficiently with: def make_tup_dtype(arr): """ Attempt to make a type capable of viewing the last axis of an array, even if it is noncontiguous. Unfortunately `.view` doesn't allow us to use this dtype in that case, which needs a patch... """ n_fields = arr.shape[1] step = arr.strides[1] descr = dict(names=[], formats=[], offsets=[], itemsize=step * n_fields) for i in range(n_fields): descr['names'].append('f{}'.format(i)) descr['offsets'].append(step * i) descr['formats'].append(arr.dtype) return np.dtype(descr) Used as:
arr = np.arange(6).reshape(3, 2)>>> arr.view(make_tup_dtype(arr)).squeeze(axis=1) array([(0, 1), (2, 3), (4, 5)], dtype=[('f0', '<i4'), ('f1', '<i4')])
Perhaps this should be provided by recfunctions (or maybe it already is, in a less rigid form?) Eric On Fri, 26 Jan 2018 at 10:48 Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/25/2018 08:53 PM, Chris Barker  NOAA Federal wrote:
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought.
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
CHB
Two thoughts:
1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1.
2. structured array's .tolist() already returns a list of tuples. If we have a 2d structured array, would it add one more layer of tuples? That would raise an exception if read back in by `np.array` with the same dtype.
These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions. If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`?
def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(1)
Then you could do:
>>> pack_last_axis(uv).tolist()
to get a list of tuples.
Allan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
Apologies, it seems that I skipped to the end of @ahaldane's remark  we're on the same page. On Fri, 26 Jan 2018 at 11:17 Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
Why is the list of tuples a useful thing to have in the first place? If the goal is to convert an array into a structured array, you can do that far more efficiently with:
def make_tup_dtype(arr): """ Attempt to make a type capable of viewing the last axis of an array, even if it is noncontiguous. Unfortunately `.view` doesn't allow us to use this dtype in that case, which needs a patch... """ n_fields = arr.shape[1] step = arr.strides[1] descr = dict(names=[], formats=[], offsets=[], itemsize=step * n_fields) for i in range(n_fields): descr['names'].append('f{}'.format(i)) descr['offsets'].append(step * i) descr['formats'].append(arr.dtype) return np.dtype(descr)
Used as:
arr = np.arange(6).reshape(3, 2)>>> arr.view(make_tup_dtype(arr)).squeeze(axis=1) array([(0, 1), (2, 3), (4, 5)], dtype=[('f0', '<i4'), ('f1', '<i4')])
Perhaps this should be provided by recfunctions (or maybe it already is, in a less rigid form?)
Eric
On Fri, 26 Jan 2018 at 10:48 Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/25/2018 08:53 PM, Chris Barker  NOAA Federal wrote:
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought.
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
CHB
Two thoughts:
1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1.
2. structured array's .tolist() already returns a list of tuples. If we have a 2d structured array, would it add one more layer of tuples? That would raise an exception if read back in by `np.array` with the same dtype.
These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions. If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`?
def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(1)
Then you could do:
>>> pack_last_axis(uv).tolist()
to get a list of tuples.
Allan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Fri, Jan 26, 2018 at 10:48 AM, Allan Haldane <allanhaldane@gmail.com> wrote:
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
Two thoughts:
1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1.
I was thinking it would be exactly like .tolist() but with tuples  so you'd get tuples all the way down (or is that turtles?) IN this use case, it would have saved me the generator expression: (tuple(r) for r in arr) not a huge deal, but it would be nice to not have to write that, and to have the looping be in C with no intermediate array generation. 2. structured array's .tolist() already returns a list of tuples. If we
have a 2d structured array, would it add one more layer of tuples?
no  why? it would return a tuple of tuples instead.
That would raise an exception if read back in by `np.array` with the same dtype.
Hmm  indeed, if the toplevel structure is a tuple, the array constructor gets confused: This works fine  as it should: In [*84*]: new_full = np.array(full.tolist(), full.dtype) But this does not: In [*85*]: new_full = np.array(tuple(full.tolist()), full.dtype)  ValueError Traceback (most recent call last) <ipythoninput85c305063184ff> in <module>() > 1 new_full = np.array(tuple(full.tolist()), full.dtype) ValueError: could not assign tuple of length 4 to structure with 2 fields. I was hoping it would dig down to the inner structures looking for a match to the dtype, rather than looking at the type of the top level. Oh well. So yeah, not sure where you would go from tuple to list  probably at the bottom level, but that may not always be unambiguous. These points make me think that instead of a `.totuple` method, this
might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module  and I'm running 1.14.0  is this a new idea?
If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`?
def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(1)
Then you could do:
>>> pack_last_axis(uv).tolist()
to get a list of tuples.
not sure what idea is here  in my example, I had a regular 2d array, so no names: In [*90*]: pack_last_axis(uv)  AttributeError Traceback (most recent call last) <ipythoninput90a75ee44c8401> in <module>() > 1 pack_last_axis(uv) <ipythoninput89cfbc76779d1f> in pack_last_axis(arr, names) * 1* def pack_last_axis(arr, names=None): > 2 if arr.names: * 3* return arr * 4* names = names or ['f{}'.format(i) for i in range(arr.shape[1 ])] * 5* return arr.view([(n, arr.dtype) for n in names]).squeeze(1) AttributeError: 'numpy.ndarray' object has no attribute 'names' So maybe you meants something like: In [*95*]: *def* pack_last_axis(arr, names=None): ...: *try*: ...: arr.names ...: *return* arr ...: *except* *AttributeError*: ...: names = names *or* ['f{}'.format(i) *for* i *in* range (arr.shape[1])] ...: *return* arr.view([(n, arr.dtype) *for* n *in* names]).squeeze(1) which does work, but seems like a convoluted way to get tuples! However, I didn't actually need tuples, I needed something I could pack into a stuctarray, and this does work, without the tolist: full = np.array(zip(time, pack_last_axis(uv)), dtype=dt) So maybe that is the way to go. I'm not sure I'd have thought to look for this function, but what can you do? Thanks for your attention to this, CHB  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
arr.names should have been arr.dtype.names in that pack_last_axis function Eric On Fri, 26 Jan 2018 at 12:45 Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Jan 26, 2018 at 10:48 AM, Allan Haldane <allanhaldane@gmail.com> wrote:
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
Two thoughts:
1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1.
I was thinking it would be exactly like .tolist() but with tuples  so you'd get tuples all the way down (or is that turtles?)
IN this use case, it would have saved me the generator expression:
(tuple(r) for r in arr)
not a huge deal, but it would be nice to not have to write that, and to have the looping be in C with no intermediate array generation.
2. structured array's .tolist() already returns a list of tuples. If we
have a 2d structured array, would it add one more layer of tuples?
no  why? it would return a tuple of tuples instead.
That would raise an exception if read back in by `np.array` with the same dtype.
Hmm  indeed, if the toplevel structure is a tuple, the array constructor gets confused:
This works fine  as it should:
In [*84*]: new_full = np.array(full.tolist(), full.dtype)
But this does not:
In [*85*]: new_full = np.array(tuple(full.tolist()), full.dtype)

ValueError Traceback (most recent call last)
<ipythoninput85c305063184ff> in <module>()
> 1 new_full = np.array(tuple(full.tolist()), full.dtype)
ValueError: could not assign tuple of length 4 to structure with 2 fields.
I was hoping it would dig down to the inner structures looking for a match to the dtype, rather than looking at the type of the top level. Oh well.
So yeah, not sure where you would go from tuple to list  probably at the bottom level, but that may not always be unambiguous.
These points make me think that instead of a `.totuple` method, this
might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module  and I'm running 1.14.0  is this a new idea?
If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`?
def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(1)
Then you could do:
>>> pack_last_axis(uv).tolist()
to get a list of tuples.
not sure what idea is here  in my example, I had a regular 2d array, so no names:
In [*90*]: pack_last_axis(uv)

AttributeError Traceback (most recent call last)
<ipythoninput90a75ee44c8401> in <module>()
> 1 pack_last_axis(uv)
<ipythoninput89cfbc76779d1f> in pack_last_axis(arr, names)
* 1* def pack_last_axis(arr, names=None):
> 2 if arr.names:
* 3* return arr
* 4* names = names or ['f{}'.format(i) for i in range(arr.shape[ 1])]
* 5* return arr.view([(n, arr.dtype) for n in names]).squeeze(1)
AttributeError: 'numpy.ndarray' object has no attribute 'names'
So maybe you meants something like:
In [*95*]: *def* pack_last_axis(arr, names=None):
...: *try*:
...: arr.names
...: *return* arr
...: *except* *AttributeError*:
...: names = names *or* ['f{}'.format(i) *for* i *in* range (arr.shape[1])]
...: *return* arr.view([(n, arr.dtype) *for* n *in* names]).squeeze(1)
which does work, but seems like a convoluted way to get tuples!
However, I didn't actually need tuples, I needed something I could pack into a stuctarray, and this does work, without the tolist:
full = np.array(zip(time, pack_last_axis(uv)), dtype=dt)
So maybe that is the way to go.
I'm not sure I'd have thought to look for this function, but what can you do?
Thanks for your attention to this,
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/26/2018 03:38 PM, Chris Barker wrote:
I was hoping it would dig down to the inner structures looking for a match to the dtype, rather than looking at the type of the top level. Oh well.
So yeah, not sure where you would go from tuple to list  probably at the bottom level, but that may not always be unambiguous.
As I remember, numpy has some fairly convoluted code for array creation which tries to make sense of various nested lists/tuples/ndarray combinations. It makes a difference for structured arrays and object arrays. I don't remember the details right now, but I know in some cases the rule is "If it's a Python list, recurse, otherwise assume it is an object array". While numpy does try to be lenient, I think we should guide the user to assume that if they want to specify a structured element, they should only use a tuple or a structured scalar, and if they want to specify a new dimension of elements, they should use a list. I expect less headaches that way.
These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module  and I'm running 1.14.0  is this a new idea?
Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions". It is actually quite old, but has never been officially documented. I think that is because it has been considered "provisional" for a long time. See https://github.com/numpy/numpy/issues/5008 https://github.com/numpy/numpy/issues/2805 I still hesitate to make it more official now, since I'm not sure that structured arrays are yet bugfree enough to encourage more complex uses. Also, the functions in that module encourage "pandaslike" use of structured arrays, but I'm not sure they should be used that way. I've been thinking they should be primarily used for binary interfaces with/to numpy, eg to talk to C programs or to read complicated binary files.
However, I didn't actually need tuples, I needed something I could pack into a stuctarray, and this does work, without the tolist:
full = np.array(zip(time, pack_last_axis(uv)), dtype=dt)
So maybe that is the way to go.
Right, that was my feeling: That we didn't really need `.totuple`, what we actually wanted is a special function for packing a nonstructuredarray as a structuredarray.
I'm not sure I'd have thought to look for this function, but what can you do?
Thanks for your attention to this,
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Fri, Jan 26, 2018 at 2:35 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
As I remember, numpy has some fairly convoluted code for array creation which tries to make sense of various nested lists/tuples/ndarray combinations. It makes a difference for structured arrays and object arrays. I don't remember the details right now, but I know in some cases the rule is "If it's a Python list, recurse, otherwise assume it is an object array".
that's at least explainable, and the "try to figure out what the user means" array cratinon is pretty much an impossible problem, so what we've got is probably about as good as it can get.
These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module  and I'm running 1.14.0  is this a new idea?
Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions".
thanks  found it.
Also, the functions in that module encourage "pandaslike" use of structured arrays, but I'm not sure they should be used that way. I've been thinking they should be primarily used for binary interfaces with/to numpy, eg to talk to C programs or to read complicated binary files.
that's my usecase. And I agree  if you really want to do that kind of thing, pandas is the way to go. I thought recarrays were pretty cool back in the day, but pandas is a much better option. So I pretty much only use structured arrays for data exchange with C code.... CHB  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
On Fri, Jan 26, 2018 at 5:48 PM, Chris Barker <chris.barker@noaa.gov> wrote:
On Fri, Jan 26, 2018 at 2:35 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
As I remember, numpy has some fairly convoluted code for array creation which tries to make sense of various nested lists/tuples/ndarray combinations. It makes a difference for structured arrays and object arrays. I don't remember the details right now, but I know in some cases the rule is "If it's a Python list, recurse, otherwise assume it is an object array".
that's at least explainable, and the "try to figure out what the user means" array cratinon is pretty much an impossible problem, so what we've got is probably about as good as it can get.
These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module  and I'm running 1.14.0  is this a new idea?
Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions".
thanks  found it.
Also, the functions in that module encourage "pandaslike" use of structured arrays, but I'm not sure they should be used that way. I've been thinking they should be primarily used for binary interfaces with/to numpy, eg to talk to C programs or to read complicated binary files.
that's my usecase. And I agree  if you really want to do that kind of thing, pandas is the way to go.
I thought recarrays were pretty cool back in the day, but pandas is a much better option.
So I pretty much only use structured arrays for data exchange with C code....
My impression is that this turns into a deprecate recarrays and supporting recfunction issue. recfunctions and the associated function from matplotlib.mlab where explicitly designed for using structured dtypes as dataframe_like. (old question: does numpy have a sort_rows function now without detouring to structured dtype views?) Josef <all code needs to be rewritten every 5 to 10 years.>
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
I thought recarrays were pretty cool back in the day, but pandas is a much better option.
So I pretty much only use structured arrays for data exchange with C code....
My impression is that this turns into a deprecate recarrays and supporting recfunction issue.
recfunctions and the associated function from matplotlib.mlab where explicitly designed for using structured dtypes as dataframe_like > (old question: does numpy have a sort_rows function now without detouring to structured dtype views?)
No, that's still the way to do it. *should* we have any dataframelike functionality in numpy? We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframelike way, when I wanted a quick multiplearray object that supported numpy indexing. So there is some demand to have minimal "dataframelike" behavior in numpy itself. recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframelike, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange. (If we want to dream, maybe one day we should make a minimal multiplearray container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.) Allan
Josef <all code needs to be rewritten every 5 to 10 years.>
CHB 
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 <tel:%28206%29%205266959> voice 7600 Sand Point Way NE (206) 5266329 <tel:%28206%29%205266329> fax Seattle, WA 98115 (206) 5266317 <tel:%28206%29%205266317> main reception
Chris.Barker@noaa.gov <mailto:Chris.Barker@noaa.gov>
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
I thought recarrays were pretty cool back in the day, but pandas is a much better option.
So I pretty much only use structured arrays for data exchange with C code....
My impression is that this turns into a deprecate recarrays and supporting recfunction issue.
*should* we have any dataframelike functionality in numpy?
We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframelike way, when I wanted a quick multiplearray object that supported numpy indexing. So there is some demand to have minimal "dataframelike" behavior in numpy itself.
recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframelike, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange.
Well, I think we should either: deprecate recarrays  i.e. explicitly not support DataFramelike functionality in numpy, keeping only the dataexchange functionality as maintained. or Properly support it  which doesn't mean reimplementing Pandas or xarray, but it would mean addressing any buglike issues like not dealing properly with padding. Personally, I don't need/want it enough to contribute, but if someone does, great. This reminds me a bit of the old numpy.Matrix issue  it was ALMOST there, but not quite, with issues, and there was essentially no overlap between the people that wanted it and the people that had the time and skills to really make it work. (If we want to dream, maybe one day we should make a minimal multiplearray
container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.)
Exactly  we really don't need to reimplement Pandas.... (except it's CSV reading capability :) ) CHB  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
I think that there's a lot of confusion going around about recarrays vs structured arrays. [`recarray`]( https://github.com/numpy/numpy/blob/v1.13.0/numpy/core/records.py) are a wrapper around structured arrays that provide: * Attribute access to fields as `arr.field` in addition to the normal `arr['field']` * Automatic datatypeguessing for nested lists of tuples (which needs a little work, but seems like a justifiable feature) * An undocumented `field` method that behaves like the 1.14 indexing behavior (!) Meanwhile, `recfunctions` is a collection of functions that work on normal structured arrays  so is misleadingly named. The only link to recarrays is that most of the functions have a `asrecarray` parameter which applies `.view(recarray)` to the result.
deprecate recarrays
Given how thin an abstraction they are over structured arrays, I don't think you mean this. Are you advocating for deprecating structured arrays entirely, or just deprecating recfunctions? Eric On Mon, 29 Jan 2018 at 09:39 Chris Barker <chris.barker@noaa.gov> wrote:
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
I thought recarrays were pretty cool back in the day, but pandas is a much better option.
So I pretty much only use structured arrays for data exchange with C code....
My impression is that this turns into a deprecate recarrays and supporting recfunction issue.
*should* we have any dataframelike functionality in numpy?
We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframelike way, when I wanted a quick multiplearray object that supported numpy indexing. So there is some demand to have minimal "dataframelike" behavior in numpy itself.
recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframelike, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange.
Well, I think we should either:
deprecate recarrays  i.e. explicitly not support DataFramelike functionality in numpy, keeping only the dataexchange functionality as maintained.
or
Properly support it  which doesn't mean reimplementing Pandas or xarray, but it would mean addressing any buglike issues like not dealing properly with padding.
Personally, I don't need/want it enough to contribute, but if someone does, great.
This reminds me a bit of the old numpy.Matrix issue  it was ALMOST there, but not quite, with issues, and there was essentially no overlap between the people that wanted it and the people that had the time and skills to really make it work.
(If we want to dream, maybe one day we should make a minimal
multiplearray container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.)
Exactly  we really don't need to reimplement Pandas....
(except it's CSV reading capability :) )
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 1:22 PM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
I think that there's a lot of confusion going around about recarrays vs structured arrays.
[`recarray`](https://github.com/numpy/numpy/blob/v1.13.0/ numpy/core/records.py) are a wrapper around structured arrays that provide: * Attribute access to fields as `arr.field` in addition to the normal `arr['field']` * Automatic datatypeguessing for nested lists of tuples (which needs a little work, but seems like a justifiable feature) * An undocumented `field` method that behaves like the 1.14 indexing behavior (!)
Meanwhile, `recfunctions` is a collection of functions that work on normal structured arrays  so is misleadingly named. The only link to recarrays is that most of the functions have a `asrecarray` parameter which applies `.view(recarray)` to the result.
deprecate recarrays
Given how thin an abstraction they are over structured arrays, I don't think you mean this. Are you advocating for deprecating structured arrays entirely, or just deprecating recfunctions?
First, statsmodels is in the pandas camp for dataframes, so I don't have any invested interest in recarrays/structured dtypes anymore. What I meant was that structured dtypes with implicit (hidden?) padding becomes unintuitive for the recarray/dataframe usecase. (At least I won't try to update my intuition about having extra things in there that are not specified by the main structured dtype.) Also the dataframe_like usage of structured dtypes doesn't seem to be much under consideration anymore. So, my **impression** is that the recent changes make the recarray/dataframe usecase for structured dtypes more difficult. Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions. Josef
Eric
On Mon, 29 Jan 2018 at 09:39 Chris Barker <chris.barker@noaa.gov> wrote:
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
I thought recarrays were pretty cool back in the day, but pandas is a much better option.
So I pretty much only use structured arrays for data exchange with C code....
My impression is that this turns into a deprecate recarrays and supporting recfunction issue.
*should* we have any dataframelike functionality in numpy?
We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframelike way, when I wanted a quick multiplearray object that supported numpy indexing. So there is some demand to have minimal "dataframelike" behavior in numpy itself.
recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframelike, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange.
Well, I think we should either:
deprecate recarrays  i.e. explicitly not support DataFramelike functionality in numpy, keeping only the dataexchange functionality as maintained.
or
Properly support it  which doesn't mean reimplementing Pandas or xarray, but it would mean addressing any buglike issues like not dealing properly with padding.
Personally, I don't need/want it enough to contribute, but if someone does, great.
This reminds me a bit of the old numpy.Matrix issue  it was ALMOST there, but not quite, with issues, and there was essentially no overlap between the people that wanted it and the people that had the time and skills to really make it work.
(If we want to dream, maybe one day we should make a minimal
multiplearray container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.)
Exactly  we really don't need to reimplement Pandas....
(except it's CSV reading capability :) )
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code? Stéfan
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations. statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly. But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests. Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset. Cheers! Ben Root On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu
wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations.
statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly.
But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests.
Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com> wrote:
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset.
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more. For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.) Josef
Cheers! Ben Root
On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt < stefanv@berkeley.edu> wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations.
statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly.
But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests.
Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/29/2018 04:02 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> wrote:
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset.
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more.
For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.)
Josef
Just to clarify, structured types have always had padding bytes, that isn't new. What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case. I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior. In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields. Allan
Cheers! Ben Root
On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations.
statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly.
But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests.
Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> wrote:
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset.
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more.
For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.)
Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0) (from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays ) Josef
Allan
Cheers! Ben Root
On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations.
statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly.
But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit
tests.
Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@
python.org>
https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> wrote:
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset.
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more.
For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.)
Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Or, to phrase it as a question: How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0? Josef
Josef
Allan
Cheers! Ben Root
On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> wrote:
On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions.
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code?
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations.
statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly.
But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit
tests.
Josef
Stéfan
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@pytho
n.org>
https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org
https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views. That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason. Pandas can do row means without these problems: >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0) Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views. I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array. Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto: josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root < ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008December/039340.html (*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpydiscussion/2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes. However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it. (*) I didn't read the details of the release notes And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels. Josef
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com
<mailto:josef.pktd@gmail.com>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@pytho n.org> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
Because dtypes were low level with clear memory layout and stayed that way Dtypes have supported padded and outoforderfields since at least 2005 (v0.8.4) <https://github.com/numpy/numpy/blob/4772f10191f87a3446f4862de6d4b953e0dd95ff/scipy/base/src/multiarraymodule.c#L2750L2766>, and I would guess that the memory layout has not changed since. The house has always been made out of glass, it just didn’t look fragile until we showed people where the stones were. On Mon, 29 Jan 2018 at 20:51 <josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto: josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root < ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008December/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpydiscussion/2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com
<mailto:josef.pktd@gmail.com>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > < https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > < https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > < https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto: NumPyDiscussion@python.org> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Tue, Jan 30, 2018 at 3:24 AM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
Because dtypes were low level with clear memory layout and stayed that way
Dtypes have supported padded and outoforderfields since at least 2005 (v0.8.4) <https://github.com/numpy/numpy/blob/4772f10191f87a3446f4862de6d4b953e0dd95ff/scipy/base/src/multiarraymodule.c#L2750L2766>, and I would guess that the memory layout has not changed since.
The house has always been made out of glass, it just didn’t look fragile until we showed people where the stones were.
Even so, I don't remember any problems with it. There might have been stones on the side streets and alleys, but 1.14.0 puts a big padded stone right in the front of the drive way. (Maybe only the solarium was made out of glass, now it's also the billiard room.) (I never had to learn about padding and I don't remember having any related problems getting statsmodels through Debian testing on various machine types.) Josef
On Mon, 29 Jan 2018 at 20:51 <josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto: josef.pktd@gmail.com>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root < ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/ 2008December/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpydiscussion/ 2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com
> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> > <mailto:josef.pktd@gmail.com
<mailto:josef.pktd@gmail.com>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpy discussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpy discussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> > <https://mail.python.org/mailman/listinfo/numpy discussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@ python.org> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org
https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Mon, Jan 29, 2018 at 7:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
IIUC, the core usecase of structured dtypes is binary compatibility with external systems (arrays of C structs, mostly)  at least that's how I use them :) In which case, "conversion of a nD structured array to an n+1D regular ndarray" is an important feature  actually even more important if you don't use recarrays So yes, let's have a utility to make that easy. as for recarrays  are we that far from having them be robust and useful? in which case, why not keep them around, fix the few issues, but explicitly not try to extend them into more dataframelike domains CHB  Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception Chris.Barker@noaa.gov
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008December/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpydiscussion/2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
It's a bit of a stretch to say that we are "silently" dropping support for dataframelike use of structured arrays. First, we still allow pretty much all dataframelike use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multifield indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception. Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then. I don't feel the need to officially declare that we are dropping support for dataframelike use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves. There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like "the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience." Allan
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>
<mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
<https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008 December/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpydiscussion/2015 October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
It's a bit of a stretch to say that we are "silently" dropping support for dataframelike use of structured arrays.
First, we still allow pretty much all dataframelike use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multifield indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception.
The 2008 mailing list thread introduced me to the working with views on structured arrays as the ONLY way to switch between structured and homogenous dtypes (if the underlying item size was homogeneous). The new stats.models started in 2009.
Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then.
If I see warnings in the test suite about getting a view instead copy from numpy, then the only/main consequence I think about is whether I need to watch out for inline modification. I didn't expect that the followup computation would change, and that it's a padded view and not a view on the selected memory. However, I just checked and padding is mentioned in the 1.12 release notes (which I never read before, ). AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on. eg. another mailing list thread I just found with the same problem http://numpydiscussion.10968.n7.nabble.com/viewofrecarrayissuetd32001.h... quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py. This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years. Josef
I don't feel the need to officially declare that we are dropping support for dataframelike use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves.
There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like
"the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience."
Allan
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>
<mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Tue, Jan 30, 2018 at 1:33 PM, <josef.pktd@gmail.com> wrote:
On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>>> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008Decem ber/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/piperma il/numpydiscussion/2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
on final historical note (once upon a time users relied on cookbooks) http://scipycookbook.readthedocs.io/items/Recarray. html#Convertingtoregulararraysandreshaping 20100309 (last modified), 20080627 (created) which I assume is broken in numpy 1.4.0
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
It's a bit of a stretch to say that we are "silently" dropping support for dataframelike use of structured arrays.
First, we still allow pretty much all dataframelike use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multifield indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception.
The 2008 mailing list thread introduced me to the working with views on structured arrays as the ONLY way to switch between structured and homogenous dtypes (if the underlying item size was homogeneous). The new stats.models started in 2009.
Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then.
If I see warnings in the test suite about getting a view instead copy from numpy, then the only/main consequence I think about is whether I need to watch out for inline modification. I didn't expect that the followup computation would change, and that it's a padded view and not a view on the selected memory. However, I just checked and padding is mentioned in the 1.12 release notes (which I never read before, ).
AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on.
eg. another mailing list thread I just found with the same problem http://numpydiscussion.10968.n7.nabble.com/viewofrecarray issuetd32001.html
quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py.
This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years.
Josef
I don't feel the need to officially declare that we are dropping support for dataframelike use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves.
There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like
"the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience."
Allan
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>
<mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Tue, Jan 30, 2018 at 2:42 PM, <josef.pktd@gmail.com> wrote:
On Tue, Jan 30, 2018 at 1:33 PM, <josef.pktd@gmail.com> wrote:
On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/29/2018 05:59 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote:
On Mon, Jan 29, 2018 at 5:50 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote:
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com> <mailto:allanhaldane@gmail.com <mailto:allanhaldane@gmail.com
> wrote:
On 01/29/2018 04:02 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> wrote: > > > On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>> > <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com> <mailto:ben.v.root@gmail.com <mailto:ben.v.root@gmail.com>>>> wrote: > > I <3 structured arrays. I love the fact that I can access data by > row and then by fieldname, or vice versa. There are times when I > need to pass just a column into a function, and there are times when > I need to process things row by row. Yes, pandas is nice if you want > the specialized indexing features, but it becomes a bear to deal > with if all you want is normal indexing, or even the ability to > easily loop over the dataset. > > > I don't think there is a doubt that structured arrays, arrays with > structured dtypes, are a useful container. The question is whether they > should be more or the foundation for more. > > For example, computing a mean, or reduce operation, over numeric element > ("columns"). Before padded views it was possible to index by selecting > the relevant "columns" and view them as standard array. With padded > views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute > a mean of some "columns". (I don't have numpy 1.14 to try or find a > workaround, like maybe looping over all relevant columns.) > > Josef
Just to clarify, structured types have always had padding bytes, that isn't new.
What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multifield indexing, which is a fairly specialized case.
I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, paddingbytes in recarrays are more or less invisible to a nonexpert who only cares about dataframelike behavior.
In other words, padding is no obstacle at all to computing a mean over a column, and singlefield indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multifield indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields.
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0)
(from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays )
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
Or, to phrase it as a question:
How do we get a standard array with homogeneous dtype from the corresponding elements of a structured dtype in numpy 1.14.0?
Josef
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views.
That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason.
Pandas can do row means without these problems:
>>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0)
Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views.
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpydiscussion/2008Decem ber/039340.html
(*) prepandas, prestackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/piperma il/numpydiscussion/2015October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
on final historical note (once upon a time users relied on cookbooks) http://scipycookbook.readthedocs.io/items/Recarray.html# Convertingtoregulararraysandreshaping 20100309 (last modified), 20080627 (created) which I assume is broken in numpy 1.4.0
and a final grumpy note https://docs.scipy.org/doc/numpy1.14.0/release.html#multiplefieldindexing... " which will affect code such as" = "which will break your code without offering an alternative" Josef <back to regular scheduled topics>
I suggest that if we want to allow either means over fields, or conversion of a nD structured array to an n+1D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array.
I don't really want to defend an obsolete (?) usecase of structured dtypes.
However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it.
(*) I didn't read the details of the release notes
And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels.
Josef
It's a bit of a stretch to say that we are "silently" dropping support for dataframelike use of structured arrays.
First, we still allow pretty much all dataframelike use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multifield indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception.
The 2008 mailing list thread introduced me to the working with views on structured arrays as the ONLY way to switch between structured and homogenous dtypes (if the underlying item size was homogeneous). The new stats.models started in 2009.
Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then.
If I see warnings in the test suite about getting a view instead copy from numpy, then the only/main consequence I think about is whether I need to watch out for inline modification. I didn't expect that the followup computation would change, and that it's a padded view and not a view on the selected memory. However, I just checked and padding is mentioned in the 1.12 release notes (which I never read before, ).
AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on.
eg. another mailing list thread I just found with the same problem http://numpydiscussion.10968.n7.nabble.com/viewofrecarray issuetd32001.html
quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py.
This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years.
Josef
I don't feel the need to officially declare that we are dropping support for dataframelike use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves.
There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like
"the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience."
Allan
Allan
Josef
Allan
> > Cheers! > Ben Root > > On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>>> wrote: > > > > On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt > <stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu> <mailto:stefanv@berkeley.edu <mailto:stefanv@berkeley.edu>>>> wrote: > > On Mon, 29 Jan 2018 14:10:56 0500, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>> > <mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>
<mailto:josef.pktd@gmail.com <mailto:josef.pktd@gmail.com>>> wrote: > > Given that there is pandas, xarray, dask and more, numpy > could as well drop > any pretense of supporting dataframe_likes. Or, adjust > the recfunctions so > we can still work dataframe_like with structured > dtypes/recarrays/recfunctions. > > > I haven't been following the duckarray discussion carefully, > but could > this be an opportunity for a dataframe protocol, so that we > can have > libraries ingest structured arrays, record arrays, pandas > dataframes, > etc. without too much specialized code? > > > AFAIU while not being in the data handling area, pandas defines > the interface and other libraries provide pandas compatible > interfaces or implementations. > > statsmodels currently still has recarray support and usage. In > some interfaces we support pandas, recarrays and plain arrays, > or anything where asarray works correctly. > > But recarrays became messy to support, one rewrite of some > functions last year converts recarrays to pandas, does the > manipulation and then converts back to recarrays. > Also we need to adjust our recarray usage with new numpy > versions. But there is no real benefit because I doubt that > statsmodels still has any recarray/structured dtype users. So, > we only have to remove our own uses in the datasets and unit tests. > > Josef > > > > > Stéfan > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _____________________________ __________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> > <https://mail.python.org/mail man/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>> > > > > > _______________________________________________ > NumPyDiscussion mailing list > NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> <mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> > https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>> >
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org
<mailto:NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org>> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion> <https://mail.python.org/mailm an/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org
https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/30/2018 01:33 PM, josef.pktd@gmail.com wrote:
AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on.
eg. another mailing list thread I just found with the same problem http://numpydiscussion.10968.n7.nabble.com/viewofrecarrayissuetd32001.h...
quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py.
This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years.
Josef
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up. What can we add to numpy.lib.recfunctions to make the multifield copy>view change smoother? We have discussed at least two functions: * repack_fields  rearrange the memory layout of a structured array to add/remove padding between fields * structured_to_unstructured  turns a nD structured array into an (n+1)D unstructured ndarray, whose dtype is the highest common type of all the fields. May want the inverse function too. We might also consider * apply_along_fields(arr, method)  applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=1) I think these are pretty minimal and shouldn't be too hard to implement. Allan
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/30/2018 01:33 PM, josef.pktd@gmail.com wrote:
AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on.
eg. another mailing list thread I just found with the same problem http://numpydiscussion.10968.n7.nabble.com/viewof recarrayissuetd32001.html
quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py.
This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years.
Josef
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up.
What can we add to numpy.lib.recfunctions to make the multifield copy>view change smoother? We have discussed at least two functions:
* repack_fields  rearrange the memory layout of a structured array to add/remove padding between fields
* structured_to_unstructured  turns a nD structured array into an (n+1)D unstructured ndarray, whose dtype is the highest common type of all the fields. May want the inverse function too.
The only sticky point with statsmodels is to have an equivalent of a[['b', 'c']].view(('f8', 2)). Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers. Another case of this might be to combine two strings into one a[['b', 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used this in serious code. for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches. Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)
We might also consider
* apply_along_fields(arr, method)  applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=1)
If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats. In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.) (e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)
I think these are pretty minimal and shouldn't be too hard to implement.
AFAICS, it would cover the statsmodels usage. Josef
Allan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On 01/30/2018 04:54 PM, josef.pktd@gmail.com wrote:
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/30/2018 01:33 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > AFAICS, one problem is that the padded view didn't come with the > matching down stream usage support, the pack function as mentioned, an > alternative way to convert to a standard ndarray, copy doesn't get rid > of the padding and so on. > > eg. another mailing list thread I just found with the same problem > http://numpydiscussion.10968.n7.nabble.com/viewofrecarrayissuetd32001.h... <http://numpydiscussion.10968.n7.nabble.com/viewofrecarrayissuetd32001.html> > > quoting Ralf: > Question: is that really the recommended way to get an (N, 2) size float > array from two columns of a larger record array? If so, why isn't there > a better way? If you'd want to write to that (N, 2) array you have to > append a copy, making it even uglier. Also, then there really should be > tests for views in test_records.py. > > > This "better way" never showed up, AFAIK. And it looks like we came back > to this problem every few years. > > Josef
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up.
What can we add to numpy.lib.recfunctions to make the multifield copy>view change smoother? We have discussed at least two functions:
* repack_fields  rearrange the memory layout of a structured array to add/remove padding between fields
* structured_to_unstructured  turns a nD structured array into an (n+1)D unstructured ndarray, whose dtype is the highest common type of all the fields. May want the inverse function too.
The only sticky point with statsmodels is to have an equivalent of a[['b', 'c']].view(('f8', 2)).
Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers. Another case of this might be to combine two strings into one a[['b', 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used this in serious code.
I implemented and put up a draft of these functions in https://github.com/numpy/numpy/pull/10411 I think they satisfy all your cases: code like >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')]) >>> a[['b', 'c']].view(('f8', 2))` becomes: >>> import numpy.lib.recfunctions as rf >>> rf.structured_to_unstructured(a[['b', 'c']]) array([[1., 1.], [1., 1.], [1., 1.]]) The highest common dtype is usually not "Object", since I use `np.result_type` to determine the output type. So two fields of 'S5' and 'S3' result in an 'S5' array.
for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches.
The inverse is implemented too. And it even supports varied field dtypes, nested fields, and subarrays, as you can see in the docstring examples.
Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)
We might also consider
* apply_along_fields(arr, method)  applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=1)
If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats.
In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.) (e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)
Means over selected columns are accounted for using multifield indexing. For example: >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)], ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')]) >>> rf.apply_along_fields(np.mean, b) array([ 2.66666667, 5.33333333, 8.66666667, 11. ]) >>> rf.apply_along_fields(np.mean, b[['x', 'z']]) array([ 3. , 5.5, 9. , 11. ]) This is unaffected by the 1.14 to 1.15 changes. Allan
I think these are pretty minimal and shouldn't be too hard to implement.
AFAICS, it would cover the statsmodels usage.
Josef
Allan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
On 01/30/2018 04:54 PM, josef.pktd@gmail.com wrote:
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com <mailto:allanhaldane@gmail.com>> wrote:
On 01/30/2018 01:33 PM, josef.pktd@gmail.com <mailto:josef.pktd@gmail.com> wrote: > AFAICS, one problem is that the padded view didn't come with the > matching down stream usage support, the pack function as
mentioned, an
> alternative way to convert to a standard ndarray, copy doesn't get
rid
> of the padding and so on. > > eg. another mailing list thread I just found with the same problem > http://numpydiscussion.10968.n7.nabble.com/viewof
recarrayissuetd32001.html
recarrayissuetd32001.html>
> > quoting Ralf: > Question: is that really the recommended way to get an (N, 2) size
float
> array from two columns of a larger record array? If so, why isn't
there
> a better way? If you'd want to write to that (N, 2) array you have
to
> append a copy, making it even uglier. Also, then there really
should be
> tests for views in test_records.py. > > > This "better way" never showed up, AFAIK. And it looks like we
came back
> to this problem every few years. > > Josef
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up.
What can we add to numpy.lib.recfunctions to make the multifield copy>view change smoother? We have discussed at least two functions:
* repack_fields  rearrange the memory layout of a structured array
to
add/remove padding between fields
* structured_to_unstructured  turns a nD structured array into an (n+1)D unstructured ndarray, whose dtype is the highest common type
of
all the fields. May want the inverse function too.
The only sticky point with statsmodels is to have an equivalent of a[['b', 'c']].view(('f8', 2)).
Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers. Another case of this might be to combine two strings into one a[['b', 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used this in serious code.
I implemented and put up a draft of these functions in https://github.com/numpy/numpy/pull/10411
Comments based on reading the last commit
I think they satisfy all your cases: code like
>>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')]) >>> a[['b', 'c']].view(('f8', 2))`
becomes:
>>> import numpy.lib.recfunctions as rf >>> rf.structured_to_unstructured(a[['b', 'c']]) array([[1., 1.], [1., 1.], [1., 1.]])
The highest common dtype is usually not "Object", since I use `np.result_type` to determine the output type. So two fields of 'S5' and 'S3' result in an 'S5' array.
structured_to_unstructured looks good to me
for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize
matches.
The inverse is implemented too. And it even supports varied field dtypes, nested fields, and subarrays, as you can see in the docstring examples.
Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)
We might also consider
* apply_along_fields(arr, method)  applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=1)
If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats.
In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.) (e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)
Means over selected columns are accounted for using multifield indexing. For example:
>>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)], ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')])
>>> rf.apply_along_fields(np.mean, b) array([ 2.66666667, 5.33333333, 8.66666667, 11. ])
>>> rf.apply_along_fields(np.mean, b[['x', 'z']]) array([ 3. , 5.5, 9. , 11. ])
actually, I would have expected apply_along_columns, i.e. reduce over all observations each field. This might need an axis argument. However, in the current form it is less practical than doing it ourselves with structured_to_unstructured because it makes a copy each time of all elements. e.g. rf.apply_along_fields(np.mean, b[['x', 'z']]) rf.apply_along_fields(np.std, b[['x', 'z']]) would do the same structured_to_unstructured copy of all array elements twice. Josef
This is unaffected by the 1.14 to 1.15 changes.
Allan
I think these are pretty minimal and shouldn't be too hard to
implement.
AFAICS, it would cover the statsmodels usage.
Josef
Allan _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org <mailto:NumPyDiscussion@python.org> https://mail.python.org/mailman/listinfo/numpydiscussion <https://mail.python.org/mailman/listinfo/numpydiscussion>
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@python.org https://mail.python.org/mailman/listinfo/numpydiscussion
participants (7)

Allan Haldane

Benjamin Root

Chris Barker

Chris Barker  NOAA Federal

Eric Wieser

josef.pktd＠gmail.com

Stefan van der Walt