Re: [Numpy-discussion] Record arrays

June 26, 2008

      On Thu, Jun 26, 2008 at 15:13, Dan Yamins <dyamins@gmail.com> wrote:
...
On Thu, Jun 26, 2008 at 3:34 PM, Gael Varoquaux
<gael.varoquaux@normalesup.org> wrote:
...
On Thu, Jun 26, 2008 at 11:48:06AM -0500, John Hunter wrote:
...
I personally think they are the best thing since sliced bread, and
everyone here who uses them becomes immediately addicted to them.  I
would like to see better support for them, especially making the attrs
exposed to dir so tab completion would work.
...
People in the financial/business world work with spreadsheet data a
lot, and record arrays are the natural data structure to represent
tabular, heterogeneous data.    If you work with this data all day,
you save a lot of ugly keystrokes doing r.date rather than r['date'],
and the code is prettier in my opinion.
I am +1 on all that.
I also completely second this.  I use them all the time -- for finance data
as well as biological/genomics data.  It is essential for these applications
to have spread-sheet like objects that can have mixed types and from which
good numpy numerical arrays can be extracted when necessary.   I hope to
continue having access to them or something like them.  I also hope that
they will be better documented, since not only do I use them all the time,
I'm hoping to teach their use to many more people whom I am training and in
spread-sheet like data analysis.
(If they have some flaw I don't understand, it would be great if someone
could explain it to me.   And if there's something out there that fixes that
flaw, I'd love to hear about it.  But it seems to me at least that recarrays
are very useful.)
Let's be clear, there are two very closely related things: recarrays
and record arrays. Record arrays are just ndarrays with a complicated
dtype. E.g.

In [1]: from numpy import *

In [2]: ones(3, dtype=dtype([('foo', int), ('bar', float)]))
Out[2]:
array([(1, 1.0), (1, 1.0), (1, 1.0)],
      dtype=[('foo', '<i4'), ('bar', '<f8')])

In [3]: r = _

In [4]: r['foo']
Out[4]: array([1, 1, 1])

recarray is a subclass of ndarray that just adds attribute access to
record arrays.

In [10]: r2 = r.view(recarray)

In [11]: r2
Out[11]:
recarray([(1, 1.0), (1, 1.0), (1, 1.0)],
      dtype=[('foo', '<i4'), ('bar', '<f8')])

In [12]: r2.foo
Out[12]: array([1, 1, 1])

One downside of this is that the attribute access feature slows down
all field accesses, even the r['foo'] form, because it sticks a bunch
of pure Python code in the middle. Much code won't notice this, but if
you end up having to iterate over an array of records (as I have),
this will be a hotspot for you.

Record arrays are fundamentally a part of numpy, and no one is even
suggesting that they would go away. No one is seriously suggesting
that we should remove recarray, but some of us hesitate to recommend
its use over plain record arrays.

Does that clarify the discussion for you?

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco