Setting custom dtypes and 1.14
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Hi all, I'm pretty sure this is the same thing as recently discussed on this list about 1.14, but to confirm: I had failures in my code with an upgrade for 1.14 -- turns out it was a single line in a single test fixture, so no big deal, but a regression just the same, with no deprecation warning. I was essentially doing this: In [*48*]: dt Out[*48*]: dtype([('time', '<i8'), ('value', [('u', '<f8'), ('v', '<f8')])], align=True) In [*49*]: uv Out[*49*]: array([[1., 1.], [1., 1.], [1., 1.], [1., 1.]]) In [*50*]: time Out[*50*]: array([1, 1, 1, 1]) In [*51*]: full = np.array(zip(time, uv), dtype=dt) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-51-ed726f71dd4a> in <module>() ----> 1 full = np.array(zip(time, uv), dtype=dt) ValueError: setting an array element with a sequence. It took some poking, but the solution was to do: full = np.array(zip(time, (tuple(w) *for* w *in* uv)), dtype=dt) That is, convert the values to nested tuples, rather than an array in a tuple, or a list in a tuple. As I said, my problem is solved, but to confirm: 1) This is a known change with good reason? 2) My solution was the best (only) one -- the only way to set a nested dtype like that is with tuples? If so, then I think we should: A) improve the error message. "ValueError: setting an array element with a sequence." Is not really clear -- I spent a while trying to figure out how I could set a nested dtype like that without a sequence? and I was actually using a ndarray, so it wasn't even a generic sequence. And a tuple is a sequence, too... I had a vague recollection that in some circumstances, numpy treats tuples and lists (and arrays) differently (fancy indexing??), so I tried the tuple thing and that worked. But I've been around numpy a long time -- that could have been very very confusing to many people. So could the message be changed to something like: "ValueError: setting an array element with a generic sequence. Only the tuple type can be used in this context." or something like that -- I'm not sure where else this same error message might pop up, so that could be totally inappropriate. 2) maybe add a .totuple()method to ndarray, much like the .tolist() method? that would have been handy here. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/25/2018 06:06 PM, Chris Barker wrote:
This change is a little different from what we discussed before. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought. If you modify your dtype above changing both 'f8' fields to 'f4', you will see you get very strange results: Your array gets filled in with the values (1, ( 0., 1.875)). Here's what happened: Previously, numpy was *not* iterating your data as a sequence. Instead, if numpy did not find a tuple it would interpret the data a a raw buffer and copy the value byte-by-byte, ignoring endianness, casting, stride, etc. You can get even weirder results if you do `uv = uv.astype('i4')`, for example. It happened to work for you because ndarrays expose a buffer interface, and you were assigning using exactly the same type and endianness. In 1.14 the fix was to disallow this 'buffer' assignment for structured arrays, it was causing quite confusing bugs. Unstructured "void" arrays still do this though.
2) My solution was the best (only) one -- the only way to set a nested dtype like that is with tuples?
Right, our solution was to only allow assignment from tuples. We might be able to relax that for structured scalars, but for arrays I remember one consideration was to avoid confusion with array broadcasting: If you do >>> x = np.zeros(2, dtype='i4,i4') >>> x[:] = np.array([3, 4]) >>> x array([(3, 3), (4, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')]) it might be the opposite of what you expect. Compare to >>> x[:] = (3, 4) >>> x array([(3, 4), (3, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')])
Good idea. I'll see if we can do it for 1.14.1.
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful. -CHB
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/25/2018 08:53 PM, Chris Barker - NOAA Federal wrote:
Two thoughts: 1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1. 2. structured array's .tolist() already returns a list of tuples. If we have a 2d structured array, would it add one more layer of tuples? That would raise an exception if read back in by `np.array` with the same dtype. These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions. If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`? def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[-1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(-1) Then you could do: >>> pack_last_axis(uv).tolist() to get a list of tuples. Allan
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
Why is the list of tuples a useful thing to have in the first place? If the goal is to convert an array into a structured array, you can do that far more efficiently with: def make_tup_dtype(arr): """ Attempt to make a type capable of viewing the last axis of an array, even if it is non-contiguous. Unfortunately `.view` doesn't allow us to use this dtype in that case, which needs a patch... """ n_fields = arr.shape[-1] step = arr.strides[-1] descr = dict(names=[], formats=[], offsets=[], itemsize=step * n_fields) for i in range(n_fields): descr['names'].append('f{}'.format(i)) descr['offsets'].append(step * i) descr['formats'].append(arr.dtype) return np.dtype(descr) Used as:
Perhaps this should be provided by recfunctions (or maybe it already is, in a less rigid form?) Eric On Fri, 26 Jan 2018 at 10:48 Allan Haldane <allanhaldane@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 10:48 AM, Allan Haldane <allanhaldane@gmail.com> wrote:
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
I was thinking it would be exactly like .tolist() but with tuples -- so you'd get tuples all the way down (or is that turtles?) IN this use case, it would have saved me the generator expression: (tuple(r) for r in arr) not a huge deal, but it would be nice to not have to write that, and to have the looping be in C with no intermediate array generation. 2. structured array's .tolist() already returns a list of tuples. If we
have a 2d structured array, would it add one more layer of tuples?
no -- why? it would return a tuple of tuples instead.
That would raise an exception if read back in by `np.array` with the same dtype.
Hmm -- indeed, if the top-level structure is a tuple, the array constructor gets confused: This works fine -- as it should: In [*84*]: new_full = np.array(full.tolist(), full.dtype) But this does not: In [*85*]: new_full = np.array(tuple(full.tolist()), full.dtype) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-85-c305063184ff> in <module>() ----> 1 new_full = np.array(tuple(full.tolist()), full.dtype) ValueError: could not assign tuple of length 4 to structure with 2 fields. I was hoping it would dig down to the inner structures looking for a match to the dtype, rather than looking at the type of the top level. Oh well. So yeah, not sure where you would go from tuple to list -- probably at the bottom level, but that may not always be unambiguous. These points make me think that instead of a `.totuple` method, this
might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module -- and I'm running 1.14.0 -- is this a new idea?
not sure what idea is here -- in my example, I had a regular 2-d array, so no names: In [*90*]: pack_last_axis(uv) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-90-a75ee44c8401> in <module>() ----> 1 pack_last_axis(uv) <ipython-input-89-cfbc76779d1f> in pack_last_axis(arr, names) * 1* def pack_last_axis(arr, names=None): ----> 2 if arr.names: * 3* return arr * 4* names = names or ['f{}'.format(i) for i in range(arr.shape[-1 ])] * 5* return arr.view([(n, arr.dtype) for n in names]).squeeze(-1) AttributeError: 'numpy.ndarray' object has no attribute 'names' So maybe you meants something like: In [*95*]: *def* pack_last_axis(arr, names=None): ...: *try*: ...: arr.names ...: *return* arr ...: *except* *AttributeError*: ...: names = names *or* ['f{}'.format(i) *for* i *in* range (arr.shape[-1])] ...: *return* arr.view([(n, arr.dtype) *for* n *in* names]).squeeze(-1) which does work, but seems like a convoluted way to get tuples! However, I didn't actually need tuples, I needed something I could pack into a stuctarray, and this does work, without the tolist: full = np.array(zip(time, pack_last_axis(uv)), dtype=dt) So maybe that is the way to go. I'm not sure I'd have thought to look for this function, but what can you do? Thanks for your attention to this, -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/26/2018 03:38 PM, Chris Barker wrote:
As I remember, numpy has some fairly convoluted code for array creation which tries to make sense of various nested lists/tuples/ndarray combinations. It makes a difference for structured arrays and object arrays. I don't remember the details right now, but I know in some cases the rule is "If it's a Python list, recurse, otherwise assume it is an object array". While numpy does try to be lenient, I think we should guide the user to assume that if they want to specify a structured element, they should only use a tuple or a structured scalar, and if they want to specify a new dimension of elements, they should use a list. I expect less headaches that way.
Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions". It is actually quite old, but has never been officially documented. I think that is because it has been considered "provisional" for a long time. See https://github.com/numpy/numpy/issues/5008 https://github.com/numpy/numpy/issues/2805 I still hesitate to make it more official now, since I'm not sure that structured arrays are yet bug-free enough to encourage more complex uses. Also, the functions in that module encourage "pandas-like" use of structured arrays, but I'm not sure they should be used that way. I've been thinking they should be primarily used for binary interfaces with/to numpy, eg to talk to C programs or to read complicated binary files.
Right, that was my feeling: That we didn't really need `.totuple`, what we actually wanted is a special function for packing a nonstructured-array as a structured-array.
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 2:35 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
that's at least explainable, and the "try to figure out what the user means" array cratinon is pretty much an impossible problem, so what we've got is probably about as good as it can get.
thanks -- found it.
that's my use-case. And I agree -- if you really want to do that kind of thing, pandas is the way to go. I thought recarrays were pretty cool back in the day, but pandas is a much better option. So I pretty much only use structured arrays for data exchange with C code.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 5:48 PM, Chris Barker <chris.barker@noaa.gov> wrote:
My impression is that this turns into a deprecate recarrays and supporting recfunction issue. recfunctions and the associated function from matplotlib.mlab where explicitly designed for using structured dtypes as dataframe_like. (old question: does numpy have a sort_rows function now without detouring to structured dtype views?) Josef <all code needs to be rewritten every 5 to 10 years.>
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
No, that's still the way to do it. *should* we have any dataframe-like functionality in numpy? We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframe-like way, when I wanted a quick multiple-array object that supported numpy indexing. So there is some demand to have minimal "dataframe-like" behavior in numpy itself. recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframe-like, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange. (If we want to dream, maybe one day we should make a minimal multiple-array container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.) Allan
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Well, I think we should either: deprecate recarrays -- i.e. explicitly not support DataFrame-like functionality in numpy, keeping only the data-exchange functionality as maintained. or Properly support it -- which doesn't mean re-implementing Pandas or xarray, but it would mean addressing any bug-like issues like not dealing properly with padding. Personally, I don't need/want it enough to contribute, but if someone does, great. This reminds me a bit of the old numpy.Matrix issue -- it was ALMOST there, but not quite, with issues, and there was essentially no overlap between the people that wanted it and the people that had the time and skills to really make it work. (If we want to dream, maybe one day we should make a minimal multiple-array
Exactly -- we really don't need to re-implement Pandas.... (except it's CSV reading capability :-) ) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
I think that there's a lot of confusion going around about recarrays vs structured arrays. [`recarray`]( https://github.com/numpy/numpy/blob/v1.13.0/numpy/core/records.py) are a wrapper around structured arrays that provide: * Attribute access to fields as `arr.field` in addition to the normal `arr['field']` * Automatic datatype-guessing for nested lists of tuples (which needs a little work, but seems like a justifiable feature) * An undocumented `field` method that behaves like the 1.14 indexing behavior (!) Meanwhile, `recfunctions` is a collection of functions that work on normal structured arrays - so is misleadingly named. The only link to recarrays is that most of the functions have a `asrecarray` parameter which applies `.view(recarray)` to the result.
deprecate recarrays
Given how thin an abstraction they are over structured arrays, I don't think you mean this. Are you advocating for deprecating structured arrays entirely, or just deprecating recfunctions? Eric On Mon, 29 Jan 2018 at 09:39 Chris Barker <chris.barker@noaa.gov> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 1:22 PM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
First, statsmodels is in the pandas camp for dataframes, so I don't have any invested interest in recarrays/structured dtypes anymore. What I meant was that structured dtypes with implicit (hidden?) padding becomes unintuitive for the recarray/dataframe usecase. (At least I won't try to update my intuition about having extra things in there that are not specified by the main structured dtype.) Also the dataframe_like usage of structured dtypes doesn't seem to be much under consideration anymore. So, my **impression** is that the recent changes make the recarray/dataframe usecase for structured dtypes more difficult. Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions. Josef
![](https://secure.gravatar.com/avatar/d9ac9213ada4a807322f99081296784b.jpg?s=120&d=mm&r=g)
On Mon, 29 Jan 2018 14:10:56 -0500, josef.pktd@gmail.com wrote:
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code? Stéfan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote:
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations. statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly. But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests. Josef
![](https://secure.gravatar.com/avatar/697900d3a29858ea20cc109a2aee0af6.jpg?s=120&d=mm&r=g)
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset. Cheers! Ben Root On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com> wrote:
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more. For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 04:02 PM, josef.pktd@gmail.com wrote:
Just to clarify, structured types have always had padding bytes, that isn't new. What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multi-field indexing, which is a fairly specialized case. I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, padding-bytes in recarrays are more or less invisible to a non-expert who only cares about dataframe-like behavior. In other words, padding is no obstacle at all to computing a mean over a column, and single-field indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multi-field indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0) (from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays ) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views. That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason. Pandas can do row means without these problems: >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0) Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views. I suggest that if we want to allow either means over fields, or conversion of a n-D structured array to an n+1-D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpy-discussion/2008-December/039340.html (*) pre-pandas, pre-stackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpy-discussion/2015-October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I don't really want to defend an obsolete (?) usecase of structured dtypes. However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it. (*) I didn't read the details of the release notes And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels. Josef
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
Because dtypes were low level with clear memory layout and stayed that way Dtypes have supported padded and out-of-order-fields since at least 2005 (v0.8.4) <https://github.com/numpy/numpy/blob/4772f10191f87a3446f4862de6d4b953e0dd95ff...>, and I would guess that the memory layout has not changed since. The house has always been made out of glass, it just didn’t look fragile until we showed people where the stones were. On Mon, 29 Jan 2018 at 20:51 <josef.pktd@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 3:24 AM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
Even so, I don't remember any problems with it. There might have been stones on the side streets and alleys, but 1.14.0 puts a big padded stone right in the front of the drive way. (Maybe only the solarium was made out of glass, now it's also the billiard room.) (I never had to learn about padding and I don't remember having any related problems getting statsmodels through Debian testing on various machine types.) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
It's a bit of a stretch to say that we are "silently" dropping support for dataframe-like use of structured arrays. First, we still allow pretty much all dataframe-like use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multi-field indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception. Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then. I don't feel the need to officially declare that we are dropping support for dataframe-like use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves. There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like "the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience." Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The 2008 mailing list thread introduced me to the working with views on structured arrays as the ONLY way to switch between structured and homogenous dtypes (if the underlying item size was homogeneous). The new stats.models started in 2009.
If I see warnings in the test suite about getting a view instead copy from numpy, then the only/main consequence I think about is whether I need to watch out for inline modification. I didn't expect that the followup computation would change, and that it's a padded view and not a view on the selected memory. However, I just checked and padding is mentioned in the 1.12 release notes (which I never read before, ). AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on. eg. another mailing list thread I just found with the same problem http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.h... quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py. This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years. Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 1:33 PM, <josef.pktd@gmail.com> wrote:
on final historical note (once upon a time users relied on cookbooks) http://scipy-cookbook.readthedocs.io/items/Recarray. html#Converting-to-regular-arrays-and-reshaping 2010-03-09 (last modified), 2008-06-27 (created) which I assume is broken in numpy 1.4.0
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 2:42 PM, <josef.pktd@gmail.com> wrote:
and a final grumpy note https://docs.scipy.org/doc/numpy-1.14.0/release.html#multiple-field-indexing... " which will affect code such as" = "which will break your code without offering an alternative" Josef <back to regular scheduled topics>
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/30/2018 01:33 PM, josef.pktd@gmail.com wrote:
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up. What can we add to numpy.lib.recfunctions to make the multi-field copy->view change smoother? We have discussed at least two functions: * repack_fields - rearrange the memory layout of a structured array to add/remove padding between fields * structured_to_unstructured - turns a n-D structured array into an (n+1)-D unstructured ndarray, whose dtype is the highest common type of all the fields. May want the inverse function too. We might also consider * apply_along_fields(arr, method) - applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=-1) I think these are pretty minimal and shouldn't be too hard to implement. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The only sticky point with statsmodels is to have an equivalent of a[['b', 'c']].view(('f8', 2)). Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers. Another case of this might be to combine two strings into one a[['b', 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used this in serious code. for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches. Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)
If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats. In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.) (e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)
I think these are pretty minimal and shouldn't be too hard to implement.
AFAICS, it would cover the statsmodels usage. Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/30/2018 04:54 PM, josef.pktd@gmail.com wrote:
I implemented and put up a draft of these functions in https://github.com/numpy/numpy/pull/10411 I think they satisfy all your cases: code like >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')]) >>> a[['b', 'c']].view(('f8', 2))` becomes: >>> import numpy.lib.recfunctions as rf >>> rf.structured_to_unstructured(a[['b', 'c']]) array([[1., 1.], [1., 1.], [1., 1.]]) The highest common dtype is usually not "Object", since I use `np.result_type` to determine the output type. So two fields of 'S5' and 'S3' result in an 'S5' array.
for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches.
The inverse is implemented too. And it even supports varied field dtypes, nested fields, and subarrays, as you can see in the docstring examples.
Means over selected columns are accounted for using multi-field indexing. For example: >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)], ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')]) >>> rf.apply_along_fields(np.mean, b) array([ 2.66666667, 5.33333333, 8.66666667, 11. ]) >>> rf.apply_along_fields(np.mean, b[['x', 'z']]) array([ 3. , 5.5, 9. , 11. ]) This is unaffected by the 1.14 to 1.15 changes. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Comments based on reading the last commit
structured_to_unstructured looks good to me
actually, I would have expected apply_along_columns, i.e. reduce over all observations each field. This might need an axis argument. However, in the current form it is less practical than doing it ourselves with structured_to_unstructured because it makes a copy each time of all elements. e.g. rf.apply_along_fields(np.mean, b[['x', 'z']]) rf.apply_along_fields(np.std, b[['x', 'z']]) would do the same structured_to_unstructured copy of all array elements twice. Josef
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 7:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
IIUC, the core use-case of structured dtypes is binary compatibility with external systems (arrays of C structs, mostly) -- at least that's how I use them :-) In which case, "conversion of a n-D structured array to an n+1-D regular ndarray" is an important feature -- actually even more important if you don't use recarrays So yes, let's have a utility to make that easy. as for recarrays -- are we that far from having them be robust and useful? in which case, why not keep them around, fix the few issues, but explicitly not try to extend them into more dataframe-like domains -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/25/2018 06:06 PM, Chris Barker wrote:
This change is a little different from what we discussed before. The change occurred because the old assignment behavior was dangerous, and was not doing what you thought. If you modify your dtype above changing both 'f8' fields to 'f4', you will see you get very strange results: Your array gets filled in with the values (1, ( 0., 1.875)). Here's what happened: Previously, numpy was *not* iterating your data as a sequence. Instead, if numpy did not find a tuple it would interpret the data a a raw buffer and copy the value byte-by-byte, ignoring endianness, casting, stride, etc. You can get even weirder results if you do `uv = uv.astype('i4')`, for example. It happened to work for you because ndarrays expose a buffer interface, and you were assigning using exactly the same type and endianness. In 1.14 the fix was to disallow this 'buffer' assignment for structured arrays, it was causing quite confusing bugs. Unstructured "void" arrays still do this though.
2) My solution was the best (only) one -- the only way to set a nested dtype like that is with tuples?
Right, our solution was to only allow assignment from tuples. We might be able to relax that for structured scalars, but for arrays I remember one consideration was to avoid confusion with array broadcasting: If you do >>> x = np.zeros(2, dtype='i4,i4') >>> x[:] = np.array([3, 4]) >>> x array([(3, 3), (4, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')]) it might be the opposite of what you expect. Compare to >>> x[:] = (3, 4) >>> x array([(3, 4), (3, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')])
Good idea. I'll see if we can do it for 1.14.1.
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Jan 25, 2018, at 4:06 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
1) This is a known change with good reason?
OK, that’s a good reason!
A) improve the error message.
Good idea. I'll see if we can do it for 1.14.1.
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful. -CHB
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/25/2018 08:53 PM, Chris Barker - NOAA Federal wrote:
Two thoughts: 1. `totuple` makes most sense for 2d arrays. But what should it do for 1d or 3+d arrays? I suppose it could make the last dimension a tuple, so 1d arrays would give a list of tuples of size 1. 2. structured array's .tolist() already returns a list of tuples. If we have a 2d structured array, would it add one more layer of tuples? That would raise an exception if read back in by `np.array` with the same dtype. These points make me think that instead of a `.totuple` method, this might be more suitable as a new function in np.lib.recfunctions. If the goal is to help manipulate structured arrays, that submodule is appropriate since it already has other functions do manipulate fields in similar ways. What about calling it `pack_last_axis`? def pack_last_axis(arr, names=None): if arr.names: return arr names = names or ['f{}'.format(i) for i in range(arr.shape[-1])] return arr.view([(n, arr.dtype) for n in names]).squeeze(-1) Then you could do: >>> pack_last_axis(uv).tolist() to get a list of tuples. Allan
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
Why is the list of tuples a useful thing to have in the first place? If the goal is to convert an array into a structured array, you can do that far more efficiently with: def make_tup_dtype(arr): """ Attempt to make a type capable of viewing the last axis of an array, even if it is non-contiguous. Unfortunately `.view` doesn't allow us to use this dtype in that case, which needs a patch... """ n_fields = arr.shape[-1] step = arr.strides[-1] descr = dict(names=[], formats=[], offsets=[], itemsize=step * n_fields) for i in range(n_fields): descr['names'].append('f{}'.format(i)) descr['offsets'].append(step * i) descr['formats'].append(arr.dtype) return np.dtype(descr) Used as:
Perhaps this should be provided by recfunctions (or maybe it already is, in a less rigid form?) Eric On Fri, 26 Jan 2018 at 10:48 Allan Haldane <allanhaldane@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 10:48 AM, Allan Haldane <allanhaldane@gmail.com> wrote:
What do folks think about a totuple() method — even before this I’ve wanted that. But in this case, it seems particularly useful.
I was thinking it would be exactly like .tolist() but with tuples -- so you'd get tuples all the way down (or is that turtles?) IN this use case, it would have saved me the generator expression: (tuple(r) for r in arr) not a huge deal, but it would be nice to not have to write that, and to have the looping be in C with no intermediate array generation. 2. structured array's .tolist() already returns a list of tuples. If we
have a 2d structured array, would it add one more layer of tuples?
no -- why? it would return a tuple of tuples instead.
That would raise an exception if read back in by `np.array` with the same dtype.
Hmm -- indeed, if the top-level structure is a tuple, the array constructor gets confused: This works fine -- as it should: In [*84*]: new_full = np.array(full.tolist(), full.dtype) But this does not: In [*85*]: new_full = np.array(tuple(full.tolist()), full.dtype) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-85-c305063184ff> in <module>() ----> 1 new_full = np.array(tuple(full.tolist()), full.dtype) ValueError: could not assign tuple of length 4 to structure with 2 fields. I was hoping it would dig down to the inner structures looking for a match to the dtype, rather than looking at the type of the top level. Oh well. So yeah, not sure where you would go from tuple to list -- probably at the bottom level, but that may not always be unambiguous. These points make me think that instead of a `.totuple` method, this
might be more suitable as a new function in np.lib.recfunctions.
I don't seem to have that module -- and I'm running 1.14.0 -- is this a new idea?
not sure what idea is here -- in my example, I had a regular 2-d array, so no names: In [*90*]: pack_last_axis(uv) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-90-a75ee44c8401> in <module>() ----> 1 pack_last_axis(uv) <ipython-input-89-cfbc76779d1f> in pack_last_axis(arr, names) * 1* def pack_last_axis(arr, names=None): ----> 2 if arr.names: * 3* return arr * 4* names = names or ['f{}'.format(i) for i in range(arr.shape[-1 ])] * 5* return arr.view([(n, arr.dtype) for n in names]).squeeze(-1) AttributeError: 'numpy.ndarray' object has no attribute 'names' So maybe you meants something like: In [*95*]: *def* pack_last_axis(arr, names=None): ...: *try*: ...: arr.names ...: *return* arr ...: *except* *AttributeError*: ...: names = names *or* ['f{}'.format(i) *for* i *in* range (arr.shape[-1])] ...: *return* arr.view([(n, arr.dtype) *for* n *in* names]).squeeze(-1) which does work, but seems like a convoluted way to get tuples! However, I didn't actually need tuples, I needed something I could pack into a stuctarray, and this does work, without the tolist: full = np.array(zip(time, pack_last_axis(uv)), dtype=dt) So maybe that is the way to go. I'm not sure I'd have thought to look for this function, but what can you do? Thanks for your attention to this, -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/26/2018 03:38 PM, Chris Barker wrote:
As I remember, numpy has some fairly convoluted code for array creation which tries to make sense of various nested lists/tuples/ndarray combinations. It makes a difference for structured arrays and object arrays. I don't remember the details right now, but I know in some cases the rule is "If it's a Python list, recurse, otherwise assume it is an object array". While numpy does try to be lenient, I think we should guide the user to assume that if they want to specify a structured element, they should only use a tuple or a structured scalar, and if they want to specify a new dimension of elements, they should use a list. I expect less headaches that way.
Sorry, I didn't specify it correctly. It is "numpy.lib.recfunctions". It is actually quite old, but has never been officially documented. I think that is because it has been considered "provisional" for a long time. See https://github.com/numpy/numpy/issues/5008 https://github.com/numpy/numpy/issues/2805 I still hesitate to make it more official now, since I'm not sure that structured arrays are yet bug-free enough to encourage more complex uses. Also, the functions in that module encourage "pandas-like" use of structured arrays, but I'm not sure they should be used that way. I've been thinking they should be primarily used for binary interfaces with/to numpy, eg to talk to C programs or to read complicated binary files.
Right, that was my feeling: That we didn't really need `.totuple`, what we actually wanted is a special function for packing a nonstructured-array as a structured-array.
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 2:35 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
that's at least explainable, and the "try to figure out what the user means" array cratinon is pretty much an impossible problem, so what we've got is probably about as good as it can get.
thanks -- found it.
that's my use-case. And I agree -- if you really want to do that kind of thing, pandas is the way to go. I thought recarrays were pretty cool back in the day, but pandas is a much better option. So I pretty much only use structured arrays for data exchange with C code.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Fri, Jan 26, 2018 at 5:48 PM, Chris Barker <chris.barker@noaa.gov> wrote:
My impression is that this turns into a deprecate recarrays and supporting recfunction issue. recfunctions and the associated function from matplotlib.mlab where explicitly designed for using structured dtypes as dataframe_like. (old question: does numpy have a sort_rows function now without detouring to structured dtype views?) Josef <all code needs to be rewritten every 5 to 10 years.>
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/26/2018 06:01 PM, josef.pktd@gmail.com wrote:
No, that's still the way to do it. *should* we have any dataframe-like functionality in numpy? We get requests every once in a while about how to sort rows, or about adding a "groupby" function. I myself have used recarrays in a dataframe-like way, when I wanted a quick multiple-array object that supported numpy indexing. So there is some demand to have minimal "dataframe-like" behavior in numpy itself. recarrays play part of this role currently, though imperfectly due to padding and cache issues. I think I'm comfortable with supporting some minor use of structured/recarrays as dataframe-like, with a warning in docs that the user should really look at pandas/xarray, and that structured arrays are primarily for data exchange. (If we want to dream, maybe one day we should make a minimal multiple-array container class. I imagine it would look pretty similar to recarray, but stored as a set of arrays instead of a structured array. But maybe recarrays are good enough, and let's not reimplement pandas either.) Allan
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Sat, Jan 27, 2018 at 8:50 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Well, I think we should either: deprecate recarrays -- i.e. explicitly not support DataFrame-like functionality in numpy, keeping only the data-exchange functionality as maintained. or Properly support it -- which doesn't mean re-implementing Pandas or xarray, but it would mean addressing any bug-like issues like not dealing properly with padding. Personally, I don't need/want it enough to contribute, but if someone does, great. This reminds me a bit of the old numpy.Matrix issue -- it was ALMOST there, but not quite, with issues, and there was essentially no overlap between the people that wanted it and the people that had the time and skills to really make it work. (If we want to dream, maybe one day we should make a minimal multiple-array
Exactly -- we really don't need to re-implement Pandas.... (except it's CSV reading capability :-) ) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
I think that there's a lot of confusion going around about recarrays vs structured arrays. [`recarray`]( https://github.com/numpy/numpy/blob/v1.13.0/numpy/core/records.py) are a wrapper around structured arrays that provide: * Attribute access to fields as `arr.field` in addition to the normal `arr['field']` * Automatic datatype-guessing for nested lists of tuples (which needs a little work, but seems like a justifiable feature) * An undocumented `field` method that behaves like the 1.14 indexing behavior (!) Meanwhile, `recfunctions` is a collection of functions that work on normal structured arrays - so is misleadingly named. The only link to recarrays is that most of the functions have a `asrecarray` parameter which applies `.view(recarray)` to the result.
deprecate recarrays
Given how thin an abstraction they are over structured arrays, I don't think you mean this. Are you advocating for deprecating structured arrays entirely, or just deprecating recfunctions? Eric On Mon, 29 Jan 2018 at 09:39 Chris Barker <chris.barker@noaa.gov> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 1:22 PM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
First, statsmodels is in the pandas camp for dataframes, so I don't have any invested interest in recarrays/structured dtypes anymore. What I meant was that structured dtypes with implicit (hidden?) padding becomes unintuitive for the recarray/dataframe usecase. (At least I won't try to update my intuition about having extra things in there that are not specified by the main structured dtype.) Also the dataframe_like usage of structured dtypes doesn't seem to be much under consideration anymore. So, my **impression** is that the recent changes make the recarray/dataframe usecase for structured dtypes more difficult. Given that there is pandas, xarray, dask and more, numpy could as well drop any pretense of supporting dataframe_likes. Or, adjust the recfunctions so we can still work dataframe_like with structured dtypes/recarrays/recfunctions. Josef
![](https://secure.gravatar.com/avatar/d9ac9213ada4a807322f99081296784b.jpg?s=120&d=mm&r=g)
On Mon, 29 Jan 2018 14:10:56 -0500, josef.pktd@gmail.com wrote:
I haven't been following the duckarray discussion carefully, but could this be an opportunity for a dataframe protocol, so that we can have libraries ingest structured arrays, record arrays, pandas dataframes, etc. without too much specialized code? Stéfan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 2:55 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote:
AFAIU while not being in the data handling area, pandas defines the interface and other libraries provide pandas compatible interfaces or implementations. statsmodels currently still has recarray support and usage. In some interfaces we support pandas, recarrays and plain arrays, or anything where asarray works correctly. But recarrays became messy to support, one rewrite of some functions last year converts recarrays to pandas, does the manipulation and then converts back to recarrays. Also we need to adjust our recarray usage with new numpy versions. But there is no real benefit because I doubt that statsmodels still has any recarray/structured dtype users. So, we only have to remove our own uses in the datasets and unit tests. Josef
![](https://secure.gravatar.com/avatar/697900d3a29858ea20cc109a2aee0af6.jpg?s=120&d=mm&r=g)
I <3 structured arrays. I love the fact that I can access data by row and then by fieldname, or vice versa. There are times when I need to pass just a column into a function, and there are times when I need to process things row by row. Yes, pandas is nice if you want the specialized indexing features, but it becomes a bear to deal with if all you want is normal indexing, or even the ability to easily loop over the dataset. Cheers! Ben Root On Mon, Jan 29, 2018 at 3:24 PM, <josef.pktd@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 3:44 PM, Benjamin Root <ben.v.root@gmail.com> wrote:
I don't think there is a doubt that structured arrays, arrays with structured dtypes, are a useful container. The question is whether they should be more or the foundation for more. For example, computing a mean, or reduce operation, over numeric element ("columns"). Before padded views it was possible to index by selecting the relevant "columns" and view them as standard array. With padded views that breaks and AFAICS, there is no way in numpy 1.14.0 to compute a mean of some "columns". (I don't have numpy 1.14 to try or find a workaround, like maybe looping over all relevant columns.) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 04:02 PM, josef.pktd@gmail.com wrote:
Just to clarify, structured types have always had padding bytes, that isn't new. What *is* new (which we are pushing to 1.15, I think) is that it may be somewhat more common to end up with padding than before, and only if you are specifically using multi-field indexing, which is a fairly specialized case. I think recfunctions already account properly for padding bytes. Except for the bug in #8100, which we will fix, padding-bytes in recarrays are more or less invisible to a non-expert who only cares about dataframe-like behavior. In other words, padding is no obstacle at all to computing a mean over a column, and single-field indexes in 1.15 behave identically as before. The only thing that will change in 1.15 is multi-field indexing, and it has never been possible to compute a mean (or any binary operation) on multiple fields. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 4:11 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
from the example in the other thread a[['b', 'c']].view(('f8', 2)).mean(0) (from the statsmodels usecase: read csv with genfromtext to get recarray or structured array select/index the numeric columns view them as standard array do whatever we can do with standard numpy arrays ) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 05:59 PM, josef.pktd@gmail.com wrote:
Oh ok, I misunderstood. I see your point: a mean over fields is more difficult than before.
The answer may be that "numpy has never had a way to that", even if in a few special cases you might hack a workaround using views. That's what your example seems like to me. It uses an explicit view, which is an "expert" feature since views depend on the exact memory layout and binary representation of the array. Your example only works if the two fields have exactly the same dtype as each other and as the final dtype, and evidently breaks if there is byte padding for any reason. Pandas can do row means without these problems: >>> pd.DataFrame(np.ones(10, dtype='i8,f8')).mean(axis=0) Numpy is missing this functionality, so you or whoever wrote that example figured out a fragile workaround using views. I suggest that if we want to allow either means over fields, or conversion of a n-D structured array to an n+1-D regular ndarray, we should add a dedicated function to do so in numpy.lib.recfunctions which does not depend on the binary representation of the array. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 10:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Once upon a time (*) this wasn't fragile but the only and recommended way. Because dtypes were low level with clear memory layout and stayed that way, it was easy to check item size or whatever and get different views on it. e.g. https://mail.scipy.org/pipermail/numpy-discussion/2008-December/039340.html (*) pre-pandas, pre-stackoverflow on the mailing lists which was for me roughly 2008 to 2012 but a late thread https://mail.scipy.org/pipermail/numpy-discussion/2015-October/074014.html "What is now the recommended way of converting structured dtypes/recarrays to ndarrays?"
I don't really want to defend an obsolete (?) usecase of structured dtypes. However, I think there should be a decision about the future plans for whether dataframe like usages of structure dtypes or through higher level classes or functions are still supported, instead of removing slowly and silently (*) the foundation for this use case, either support this usage or say you will be dropping it. (*) I didn't read the details of the release notes And another footnote about obsolete: Given that I'm the only one arguing about the dataframe_like usecase of recarrays and structured dtypes, I think they are dead for this specific usecase and only my inertia and conservativeness kept them alive in statsmodels. Josef
![](https://secure.gravatar.com/avatar/209654202cde8ec709dee0a4d23c717d.jpg?s=120&d=mm&r=g)
Because dtypes were low level with clear memory layout and stayed that way Dtypes have supported padded and out-of-order-fields since at least 2005 (v0.8.4) <https://github.com/numpy/numpy/blob/4772f10191f87a3446f4862de6d4b953e0dd95ff...>, and I would guess that the memory layout has not changed since. The house has always been made out of glass, it just didn’t look fragile until we showed people where the stones were. On Mon, 29 Jan 2018 at 20:51 <josef.pktd@gmail.com> wrote:
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 3:24 AM, Eric Wieser <wieser.eric+numpy@gmail.com> wrote:
Even so, I don't remember any problems with it. There might have been stones on the side streets and alleys, but 1.14.0 puts a big padded stone right in the front of the drive way. (Maybe only the solarium was made out of glass, now it's also the billiard room.) (I never had to learn about padding and I don't remember having any related problems getting statsmodels through Debian testing on various machine types.) Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/29/2018 11:50 PM, josef.pktd@gmail.com wrote:
It's a bit of a stretch to say that we are "silently" dropping support for dataframe-like use of structured arrays. First, we still allow pretty much all dataframe-like use we have supported since numpy 1.7, limited as it may be. We are really only dropping one very specialized, expert use involving an explicit view, which I still have doubts was ever more than a hack. That 2008 mailing list message didn't involve multi-field indexing, which didn't exist then (only introduced in 2009), and we have wanted to make them views (not copies) since their inception. Second, I don't think we are doing so silently: We have warned about this in release notes since numpy 1.7 in 2012/2013, and it gets mention in most releases since then. We have also raised FutureWarnings about it since 1.7. Unfortunately we missed warning in your specific case for a while, but we corrected this in 1.12 so you should have seen FutureWarnings since then. I don't feel the need to officially declare that we are dropping support for dataframe-like use of structured arrays. It's unclear where that use ends and other uses of structured arrays begin. I think updating the docs to warn that pandas/dask may be a better choice is enough, as I've been doing, and then users can decide for themselves. There is still the question about whether we should make numpy.lib.recfunctions more official. I don't have a strong opinion. I suppose it would be good to add a section to the structured array docs which lists those methods and says something like "the submodule numpy.lib.recfunctions provides minimal functionality to split, combine, and manipulate structured datatypes and arrays. In most cases, we strongly recommend users use a dedicated module such as pandas/xarray/dask instead of these methods, but they are provided for occasional convenience." Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 12:28 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The 2008 mailing list thread introduced me to the working with views on structured arrays as the ONLY way to switch between structured and homogenous dtypes (if the underlying item size was homogeneous). The new stats.models started in 2009.
If I see warnings in the test suite about getting a view instead copy from numpy, then the only/main consequence I think about is whether I need to watch out for inline modification. I didn't expect that the followup computation would change, and that it's a padded view and not a view on the selected memory. However, I just checked and padding is mentioned in the 1.12 release notes (which I never read before, ). AFAICS, one problem is that the padded view didn't come with the matching down stream usage support, the pack function as mentioned, an alternative way to convert to a standard ndarray, copy doesn't get rid of the padding and so on. eg. another mailing list thread I just found with the same problem http://numpy-discussion.10968.n7.nabble.com/view-of-recarray-issue-td32001.h... quoting Ralf: Question: is that really the recommended way to get an (N, 2) size float array from two columns of a larger record array? If so, why isn't there a better way? If you'd want to write to that (N, 2) array you have to append a copy, making it even uglier. Also, then there really should be tests for views in test_records.py. This "better way" never showed up, AFAIK. And it looks like we came back to this problem every few years. Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 1:33 PM, <josef.pktd@gmail.com> wrote:
on final historical note (once upon a time users relied on cookbooks) http://scipy-cookbook.readthedocs.io/items/Recarray. html#Converting-to-regular-arrays-and-reshaping 2010-03-09 (last modified), 2008-06-27 (created) which I assume is broken in numpy 1.4.0
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 2:42 PM, <josef.pktd@gmail.com> wrote:
and a final grumpy note https://docs.scipy.org/doc/numpy-1.14.0/release.html#multiple-field-indexing... " which will affect code such as" = "which will break your code without offering an alternative" Josef <back to regular scheduled topics>
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/30/2018 01:33 PM, josef.pktd@gmail.com wrote:
Since we are at least pushing off this change to a later release (1.15?), we have some time to prepare/catch up. What can we add to numpy.lib.recfunctions to make the multi-field copy->view change smoother? We have discussed at least two functions: * repack_fields - rearrange the memory layout of a structured array to add/remove padding between fields * structured_to_unstructured - turns a n-D structured array into an (n+1)-D unstructured ndarray, whose dtype is the highest common type of all the fields. May want the inverse function too. We might also consider * apply_along_fields(arr, method) - applies the method along the "field" axis, equivalent to something like method(struct_to_unstructured(arr), axis=-1) I think these are pretty minimal and shouldn't be too hard to implement. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 3:21 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
The only sticky point with statsmodels is to have an equivalent of a[['b', 'c']].view(('f8', 2)). Highest common dtype might be object, the main usecase for this is to select some elements of a specific dtype and then use them as standard,homogeneous ndarray. In our case and other cases that I have seen it is mainly to select a subset of the floating point numbers. Another case of this might be to combine two strings into one a[['b', 'c']].view(('S8')) if b is s5 and c is S3, but I don't think I used this in serious code. for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches. Browsing through old mailing list threads, I saw that adding multiple fields or concatenating two arrays with structured dtypes into an array with a single combined dtype was missing and I guess still is. (IIRC this is the usecase where we go now the pandas detour in statsmodels.)
If this works on a padded view of an existing array, then this would be an improvement over the current version of having to extract and copy the relevant fields of an existing structured dtype or loop over different numeric dtypes, ints, floats. In general there will need to be a way to apply `method` only to selected columns, or columns of a matching dtype. (e.g. We don't want the sum or mean of a string.) (e.g. we use ptp() on numeric fields to check if there is already a constant column in the array or dataframe)
I think these are pretty minimal and shouldn't be too hard to implement.
AFAICS, it would cover the statsmodels usage. Josef
![](https://secure.gravatar.com/avatar/71832763447894e7c7f3f64bfd19c13f.jpg?s=120&d=mm&r=g)
On 01/30/2018 04:54 PM, josef.pktd@gmail.com wrote:
I implemented and put up a draft of these functions in https://github.com/numpy/numpy/pull/10411 I think they satisfy all your cases: code like >>> a = np.ones(3, dtype=[('a', 'f8'), ('b', 'f8'), ('c', 'f8')]) >>> a[['b', 'c']].view(('f8', 2))` becomes: >>> import numpy.lib.recfunctions as rf >>> rf.structured_to_unstructured(a[['b', 'c']]) array([[1., 1.], [1., 1.], [1., 1.]]) The highest common dtype is usually not "Object", since I use `np.result_type` to determine the output type. So two fields of 'S5' and 'S3' result in an 'S5' array.
for inverse function: I guess it is still possible to view any standard homogenous ndarray with a structured dtype as long as the itemsize matches.
The inverse is implemented too. And it even supports varied field dtypes, nested fields, and subarrays, as you can see in the docstring examples.
Means over selected columns are accounted for using multi-field indexing. For example: >>> b = np.array([(1, 2, 5), (4, 5, 7), (7, 8 ,11), (10, 11, 12)], ... dtype=[('x', 'i4'), ('y', 'f4'), ('z', 'f8')]) >>> rf.apply_along_fields(np.mean, b) array([ 2.66666667, 5.33333333, 8.66666667, 11. ]) >>> rf.apply_along_fields(np.mean, b[['x', 'z']]) array([ 3. , 5.5, 9. , 11. ]) This is unaffected by the 1.14 to 1.15 changes. Allan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Tue, Jan 30, 2018 at 7:33 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
Comments based on reading the last commit
structured_to_unstructured looks good to me
actually, I would have expected apply_along_columns, i.e. reduce over all observations each field. This might need an axis argument. However, in the current form it is less practical than doing it ourselves with structured_to_unstructured because it makes a copy each time of all elements. e.g. rf.apply_along_fields(np.mean, b[['x', 'z']]) rf.apply_along_fields(np.std, b[['x', 'z']]) would do the same structured_to_unstructured copy of all array elements twice. Josef
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Mon, Jan 29, 2018 at 7:44 PM, Allan Haldane <allanhaldane@gmail.com> wrote:
IIUC, the core use-case of structured dtypes is binary compatibility with external systems (arrays of C structs, mostly) -- at least that's how I use them :-) In which case, "conversion of a n-D structured array to an n+1-D regular ndarray" is an important feature -- actually even more important if you don't use recarrays So yes, let's have a utility to make that easy. as for recarrays -- are we that far from having them be robust and useful? in which case, why not keep them around, fix the few issues, but explicitly not try to extend them into more dataframe-like domains -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (7)
-
Allan Haldane
-
Benjamin Root
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Eric Wieser
-
josef.pktd@gmail.com
-
Stefan van der Walt