Mailman 3 Record arrays - NumPy-Discussion

Record arrays

Stéfan van der Walt

June 26, 2008

12:13 p.m.

Hi all, I am documenting `recarray`, and have a question: Is its use still recommended, or has it been superseded by fancy data-types? Regards Stéfan

Show replies by date

Christopher Hanley

June 2008

12:31 p.m.

Stéfan van der Walt wrote:

...

Hi all,

I am documenting `recarray`, and have a question:

Is its use still recommended, or has it been superseded by fancy data-types?

Regards Stéfan _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

I would say that it has been superseded by fancy data-types. It is necessary that recarray remain for backward compatibility with large, legacy numarray projects. However, I would encourage all new code be written with the native object arrays. Chris -- Christopher Hanley Systems Software Engineer Space Telescope Science Institute 3700 San Martin Drive Baltimore MD, 21218 (410) 338-4338

Travis E. Oliphant

12:38 p.m.

Stéfan van der Walt wrote:

...

Hi all,

I am documenting `recarray`, and have a question:

Is its use still recommended, or has it been superseded by fancy data-types?

I rarely recommend it's use (but some people do like attribute access to the fields). It is wrong, however, to say that recarray has been superseded by fancy data types because fancy data types have existed for as long as recarrays. I believe pyfits uses them quite a bit, and so they deserve to be documented. -Travis

Christopher Hanley

12:45 p.m.

Travis E. Oliphant wrote:

...

Stéfan van der Walt wrote:

...
Hi all,

I am documenting `recarray`, and have a question:

Is its use still recommended, or has it been superseded by fancy data-types?

I rarely recommend it's use (but some people do like attribute access to the fields). It is wrong, however, to say that recarray has been superseded by fancy data types because fancy data types have existed for as long as recarrays.

I believe pyfits uses them quite a bit, and so they deserve to be documented.

-Travis

_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Travis is correct. PyFITS uses recarrays quite extensively. It was the large, legacy numarray project I was referring too. ;-) I had forgotten about the attribute access. I know a number of people who use that feature in conjunction with matplotlib for plotting data in tables, especially during interactive use. Chris -- Christopher Hanley Systems Software Engineer Space Telescope Science Institute 3700 San Martin Drive Baltimore MD, 21218 (410) 338-4338

Perry Greenfield

7:51 p.m.

Hi Chris, Didn't we remove all dependence on recarray? I could have sworn we did that. Perry On Jun 26, 2008, at 12:45 PM, Christopher Hanley wrote:

...

Travis E. Oliphant wrote:

...
Stéfan van der Walt wrote:

...
Hi all,

I am documenting `recarray`, and have a question:

Is its use still recommended, or has it been superseded by fancy data-types?

I rarely recommend it's use (but some people do like attribute access to the fields). It is wrong, however, to say that recarray has been superseded by fancy data types because fancy data types have existed for as long as recarrays.

I believe pyfits uses them quite a bit, and so they deserve to be documented.

-Travis

_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Travis is correct. PyFITS uses recarrays quite extensively. It was the large, legacy numarray project I was referring too. ;-)

I had forgotten about the attribute access. I know a number of people who use that feature in conjunction with matplotlib for plotting data in tables, especially during interactive use.

Chris

-- Christopher Hanley Systems Software Engineer Space Telescope Science Institute 3700 San Martin Drive Baltimore MD, 21218 (410) 338-4338 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Christopher Hanley

8:41 p.m.

Perry Greenfield wrote:

...

Hi Chris,

Didn't we remove all dependence on recarray? I could have sworn we did that.

Perry

Perry, You are right. We no longer import the recarray module from numpy. Chris -- Christopher Hanley Systems Software Engineer Space Telescope Science Institute 3700 San Martin Drive Baltimore MD, 21218 (410) 338-4338

John Hunter

12:48 p.m.

On Thu, Jun 26, 2008 at 11:38 AM, Travis E. Oliphant <oliphant@enthought.com> wrote:

...

Stéfan van der Walt wrote:

...
Hi all,

I am documenting `recarray`, and have a question:

Is its use still recommended, or has it been superseded by fancy data-types?

I rarely recommend it's use (but some people do like attribute access to the fields). It is wrong, however, to say that recarray has been superseded by fancy data types because fancy data types have existed for as long as recarrays.

I personally think they are the best thing since sliced bread, and everyone here who uses them becomes immediately addicted to them. I would like to see better support for them, especially making the attrs exposed to dir so tab completion would work. People in the financial/business world work with spreadsheet data a lot, and record arrays are the natural data structure to represent tabular, heterogeneous data. If you work with this data all day, you save a lot of ugly keystrokes doing r.date rather than r['date'], and the code is prettier in my opinion. JDH

Gael Varoquaux

3:34 p.m.

On Thu, Jun 26, 2008 at 11:48:06AM -0500, John Hunter wrote:

...

I personally think they are the best thing since sliced bread, and everyone here who uses them becomes immediately addicted to them. I would like to see better support for them, especially making the attrs exposed to dir so tab completion would work.

...

People in the financial/business world work with spreadsheet data a lot, and record arrays are the natural data structure to represent tabular, heterogeneous data. If you work with this data all day, you save a lot of ugly keystrokes doing r.date rather than r['date'], and the code is prettier in my opinion.

I am +1 on all that. Gael

Dan Yamins

4:13 p.m.

On Thu, Jun 26, 2008 at 3:34 PM, Gael Varoquaux < gael.varoquaux@normalesup.org> wrote:

...

On Thu, Jun 26, 2008 at 11:48:06AM -0500, John Hunter wrote:

...
I personally think they are the best thing since sliced bread, and everyone here who uses them becomes immediately addicted to them. I would like to see better support for them, especially making the attrs exposed to dir so tab completion would work.

...
People in the financial/business world work with spreadsheet data a lot, and record arrays are the natural data structure to represent tabular, heterogeneous data. If you work with this data all day, you save a lot of ugly keystrokes doing r.date rather than r['date'], and the code is prettier in my opinion.

I am +1 on all that.

I also completely second this. I use them all the time -- for finance data as well as biological/genomics data. It is essential for these applications to have spread-sheet like objects that can have mixed types and from which good numpy numerical arrays can be extracted when necessary. I hope to continue having access to them or something like them. I also hope that they will be better documented, since not only do I use them all the time, I'm hoping to teach their use to many more people whom I am training and in spread-sheet like data analysis. (If they have some flaw I don't understand, it would be great if someone could explain it to me. And if there's something out there that fixes that flaw, I'd love to hear about it. But it seems to me at least that recarrays are very useful.)

Robert Kern

4:25 p.m.

On Thu, Jun 26, 2008 at 15:13, Dan Yamins <dyamins@gmail.com> wrote:

...

On Thu, Jun 26, 2008 at 3:34 PM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:

...
On Thu, Jun 26, 2008 at 11:48:06AM -0500, John Hunter wrote:

...
I personally think they are the best thing since sliced bread, and everyone here who uses them becomes immediately addicted to them. I would like to see better support for them, especially making the attrs exposed to dir so tab completion would work.

...
People in the financial/business world work with spreadsheet data a lot, and record arrays are the natural data structure to represent tabular, heterogeneous data. If you work with this data all day, you save a lot of ugly keystrokes doing r.date rather than r['date'], and the code is prettier in my opinion.

I am +1 on all that.

I also completely second this. I use them all the time -- for finance data as well as biological/genomics data. It is essential for these applications to have spread-sheet like objects that can have mixed types and from which good numpy numerical arrays can be extracted when necessary. I hope to continue having access to them or something like them. I also hope that they will be better documented, since not only do I use them all the time, I'm hoping to teach their use to many more people whom I am training and in spread-sheet like data analysis.

(If they have some flaw I don't understand, it would be great if someone could explain it to me. And if there's something out there that fixes that flaw, I'd love to hear about it. But it seems to me at least that recarrays are very useful.)

Let's be clear, there are two very closely related things: recarrays and record arrays. Record arrays are just ndarrays with a complicated dtype. E.g. In [1]: from numpy import * In [2]: ones(3, dtype=dtype([('foo', int), ('bar', float)])) Out[2]: array([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')]) In [3]: r = _ In [4]: r['foo'] Out[4]: array([1, 1, 1]) recarray is a subclass of ndarray that just adds attribute access to record arrays. In [10]: r2 = r.view(recarray) In [11]: r2 Out[11]: recarray([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')]) In [12]: r2.foo Out[12]: array([1, 1, 1]) One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you. Record arrays are fundamentally a part of numpy, and no one is even suggesting that they would go away. No one is seriously suggesting that we should remove recarray, but some of us hesitate to recommend its use over plain record arrays. Does that clarify the discussion for you? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Gabriel Gellner

4:38 p.m.

...

Let's be clear, there are two very closely related things: recarrays and record arrays. Record arrays are just ndarrays with a complicated dtype. E.g.

In [1]: from numpy import *

In [2]: ones(3, dtype=dtype([('foo', int), ('bar', float)])) Out[2]: array([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')])

In [3]: r = _

In [4]: r['foo'] Out[4]: array([1, 1, 1])

recarray is a subclass of ndarray that just adds attribute access to record arrays.

In [10]: r2 = r.view(recarray)

In [11]: r2 Out[11]: recarray([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')])

In [12]: r2.foo Out[12]: array([1, 1, 1])

One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

Record arrays are fundamentally a part of numpy, and no one is even suggesting that they would go away. No one is seriously suggesting that we should remove recarray, but some of us hesitate to recommend its use over plain record arrays.

Does that clarify the discussion for you?

Thanks! This has always been something that has confused me . . . This is awesome, I guess I build by DataFrame object for nothing :-) Gabriel

Dan Yamins

4:39 p.m.

...

In [12]: r2.foo Out[12]: array([1, 1, 1])

One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

Record arrays are fundamentally a part of numpy, and no one is even suggesting that they would go away. No one is seriously suggesting that we should remove recarray, but some of us hesitate to recommend its use over plain record arrays.

Does that clarify the discussion for you?

Yes, thanks very much, this is very helpful. (I think I was confused by the fact that, AFAICT, the Guide to Numpy only mentions recarray -- as distinct from Record arrays -- in one somewhat cryptic line.) But I guess that the numpy documentation work going on now will provide good documentation for using Record Arrays proper?

Travis E. Oliphant

6:07 p.m.

...

Does that clarify the discussion for you?

Yes, thanks very much, this is very helpful. (I think I was confused by the fact that, AFAICT, the Guide to Numpy only mentions recarray -- as distinct from Record arrays -- in one somewhat cryptic line.) But I guess that the numpy documentation work going on now will provide good documentation for using Record Arrays proper?

Incidentally. Eric and I use the term "structured arrays" to refer to NumPy arrays with a complicated dtype, precisely because of the confusion with the recarray subclass that record arrays sometimes engenders. -Travis

Gael Varoquaux

10:24 p.m.

I understand all your comments and thank you for making this distinction explicit. I can see why recarray can slow code down, but I find attribute lookup make code much more readable, and interactive work fantastic (tab completion). For many of my applications I do have a strong use case for these recarrays, and I am willing to take the speek cost (many of the things I do are very for from being numerically intensiv). On a side note, a pattern I use a lot (and incidently that Fernando and Brian also came up with in ipython1) is a mixed object that acts like a dictionary (and thus comes with all the goodies like the keys, iterkeys, ... methods, and the "in"), but exposes its keys as attributes: class Bunch(dict): def __init__(self, **kwargs): dict.__init__(self, **kwargs) self.__dict__ = self a = Bunch(a=1, b=2) This is not directly related to the discussion, as the recarrays add more to this (eg operations uniform over all the fields), but it does show that this pattern is liked by many people. My 2 cents, Gaël On Thu, Jun 26, 2008 at 03:25:11PM -0500, Robert Kern wrote:

...

Let's be clear, there are two very closely related things: recarrays and record arrays. Record arrays are just ndarrays with a complicated dtype. E.g.

...

In [1]: from numpy import *

...

In [2]: ones(3, dtype=dtype([('foo', int), ('bar', float)])) Out[2]: array([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')])

...

In [3]: r = _

...

In [4]: r['foo'] Out[4]: array([1, 1, 1])

...

recarray is a subclass of ndarray that just adds attribute access to record arrays.

...

In [10]: r2 = r.view(recarray)

...

In [11]: r2 Out[11]: recarray([(1, 1.0), (1, 1.0), (1, 1.0)], dtype=[('foo', '<i4'), ('bar', '<f8')])

...

In [12]: r2.foo Out[12]: array([1, 1, 1])

...

One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

...

Record arrays are fundamentally a part of numpy, and no one is even suggesting that they would go away. No one is seriously suggesting that we should remove recarray, but some of us hesitate to recommend its use over plain record arrays.

...

Does that clarify the discussion for you?

Robert Kern

10:36 p.m.

On Thu, Jun 26, 2008 at 21:24, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:

...

I understand all your comments and thank you for making this distinction explicit. I can see why recarray can slow code down, but I find attribute lookup make code much more readable, and interactive work fantastic (tab completion).

I'm confused. recarray fields do not show up in any standard tab-completion schemes.

...

For many of my applications I do have a strong use case for these recarrays, and I am willing to take the speek cost (many of the things I do are very for from being numerically intensiv).

On a side note, a pattern I use a lot (and incidently that Fernando and Brian also came up with in ipython1) is a mixed object that acts like a dictionary (and thus comes with all the goodies like the keys, iterkeys, ... methods, and the "in"), but exposes its keys as attributes:

class Bunch(dict):

def __init__(self, **kwargs): dict.__init__(self, **kwargs) self.__dict__ = self

a = Bunch(a=1, b=2)

Actually, I wrote that particular snippet in the IPython codebase, but the idea comes from a Cookbook entry: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52308 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Gael Varoquaux

10:46 p.m.

On Thu, Jun 26, 2008 at 09:36:38PM -0500, Robert Kern wrote:

...

On Thu, Jun 26, 2008 at 21:24, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:

...
I understand all your comments and thank you for making this distinction explicit. I can see why recarray can slow code down, but I find attribute lookup make code much more readable, and interactive work fantastic (tab completion).

...

I'm confused. recarray fields do not show up in any standard tab-completion schemes.

Damn you are right. So I what just making noise I guess. But my point about code readability remains, and it is shorter to type.

...

Actually, I wrote that particular snippet in the IPython codebase, but the idea comes from a Cookbook entry:

:) Gaël

Fernando Perez

1:10 a.m.

On Thu, Jun 26, 2008 at 1:25 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

I wonder if it wouldn't be useful for *all* numpy arrays to have a .f attribute that would provide attribute access to fields for complex dtypes: In [13]: r['foo'] Out[13]: array([1, 1, 1]) In [14]: r.f.foo -> Hypothetically, same as [13] above This object would be in general an empty namespace, thus avoiding the potential for collisions that recarrays have, could normalize field names to be valid python identifiers (spaces to _, etc) and could provide name TAB completion. Since the .f object would be a *separate* object, the main array wouldn't need to have complex python code in the fast path and there would be no speed penalty for other uses of the top level object. I've never quite liked recarrays because of the fact that they blend the named fields with the main namespace, and because they don't tab complete. I'd happily pay the price of accessing a sub-object for a cleaner and more useful access to fields (I could always do xf=x.f if I am really going to use the field object a lot). Just an idea, perhaps it's already been shut down in the past. Cheers, f

Sebastian Haase

8:10 p.m.

On Fri, Jun 27, 2008 at 7:10 AM, Fernando Perez <fperez.net@gmail.com> wrote:

...

On Thu, Jun 26, 2008 at 1:25 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

I wonder if it wouldn't be useful for *all* numpy arrays to have a .f attribute that would provide attribute access to fields for complex dtypes:

In [13]: r['foo'] Out[13]: array([1, 1, 1])

In [14]: r.f.foo -> Hypothetically, same as [13] above

This object would be in general an empty namespace, thus avoiding the potential for collisions that recarrays have, could normalize field names to be valid python identifiers (spaces to _, etc) and could provide name TAB completion. Since the .f object would be a *separate* object, the main array wouldn't need to have complex python code in the fast path and there would be no speed penalty for other uses of the top level object.

I've never quite liked recarrays because of the fact that they blend the named fields with the main namespace, and because they don't tab complete. I'd happily pay the price of accessing a sub-object for a cleaner and more useful access to fields (I could always do xf=x.f if I am really going to use the field object a lot).

Just an idea, perhaps it's already been shut down in the past.

+ 1 -- Sebastian Haase

Stéfan van der Walt

July 2008

3:34 a.m.

2008/6/27 Fernando Perez <fperez.net@gmail.com>:

...

On Thu, Jun 26, 2008 at 1:25 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
One downside of this is that the attribute access feature slows down all field accesses, even the r['foo'] form, because it sticks a bunch of pure Python code in the middle. Much code won't notice this, but if you end up having to iterate over an array of records (as I have), this will be a hotspot for you.

I wonder if it wouldn't be useful for *all* numpy arrays to have a .f attribute that would provide attribute access to fields for complex dtypes:

In [13]: r['foo'] Out[13]: array([1, 1, 1])

In [14]: r.f.foo -> Hypothetically, same as [13] above

I like this idea, and think it is worth exploring further. It would have been even better if we could have done x.f.field.subfield Unfortunately, there is no way (I know of) to tell `f` whether getattribute is being called further down the chain. But even having x.f.field.f.subfield would already be useful. Stéfan

6079

Age (days ago)

6086

Last active (days ago)

List overview

Download

18 comments

11 participants

participants (11)

Christopher Hanley
Dan Yamins
Fernando Perez
Gabriel Gellner
Gael Varoquaux
John Hunter
Perry Greenfield
Robert Kern
Sebastian Haase
Stéfan van der Walt
Travis E. Oliphant

Record arrays

Christopher Hanley

Christopher Hanley

Christopher Hanley

Gabriel Gellner

Sebastian Haase

tags

participants (11)