[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Thu Jul 8 23:20:19 EDT 2010

On Thu, Jul 8, 2010 at 22:43, Bruce Southey <bsouthey at gmail.com> wrote:
> On Thu, Jul 8, 2010 at 5:09 PM, Robert Kern <robert.kern at gmail.com> wrote:
>> On Thu, Jul 8, 2010 at 18:00, Bruce Southey <bsouthey at gmail.com> wrote:
>>> On Thu, Jul 8, 2010 at 4:39 PM, Rob Speer <rspeer at mit.edu> wrote:
>>>>>> Still, I have a question. Did you also agree that this should forcibly index
>>>>>> through ticks?
>>>>>>
>>>>>>  arr.something[int]      -> tick-based indexing
>>>>>>
>>>>>
>>>>> Yes.
>>>>
>>>> I feel like people are talking about different things because it's
>>>> unclear what the .something is.
>>>>
>>>> If the .something is an axis name, then no. arr.year[0] should get the
>>>> first year in the data, not the data from the "year 0".
>>>>
>>>> If the .something is the attribute we use for named lookup (such as
>>>> ".named"), then yes. arr.named[2006] should get whatever tick is named
>>>> 2006 on the first axis.
>>>> -- Rob
>>>> _______________________________________________
>>>
>>> Then how is this not different than a record array?
>>
>> A record array lets you label exactly one notional "axis" (which isn't
>> actually an axis as far as numpy is concerned). This lets you label
>> all of the axes in a multidimensional array.

> I based this on the example at:
> http://www.scipy.org/RecordArrays
>
>>>> import numpy as np
>>>> img = np.array([[(0,0,0), (1,0,0)], [(0,1,0), (0,0,1)]], {'names': ('named','g','b'), 'formats': ('f4', 'f4', 'f4')})
>>>> arr= img.view(np.recarray)
>>>> arr.named
> array([[ 0.,  1.],
>       [ 0.,  0.]], dtype=float32)
>>>> arr.named[:,1]
> array([ 1.,  0.], dtype=float32)
>>>> img['named']
> array([[ 0.,  1.],
>       [ 0.,  0.]], dtype=float32)
>>>> arr['named']
> array([[ 0.,  1.],
>       [ 0.,  0.]], dtype=float32)

I really don't know what you think you are demonstrating with this example.

> I think that we need consistency with ndarrays such that the first
> index is to the first axis, the second is to the second axis etc. This
> means that the actual axis name is perhaps irrelevant when indexing
> and slicing etc. Actually I have trouble thinking about how you refer
> to a single axis in a multiple dimensional cases without addressing
> the other axes.

There are two related, but distinct concepts being proposed: being
able to label axes and being able label indices *along* each axis. The
proposal for referring to a single labelled axis is through an
attribute off of the ndarray.

>>> narr = DataArray(np.zeros((1,2,3)), labels=('a','b','c'))
>>> narr.axis.a
Axis(label='a', index=0, ticks=None)
>>> narr.axis.a[0]
DataArray([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
('b', 'c')
>>> narr.axis.a[0].axes
(Axis(label='b', index=0, ticks=None), Axis(label='c', index=1, ticks=None))

> So from an example from Lluis:
> "As axis always have a total order, I'd go for the most compact representation
> (assuming 'country' is the first axis, and 'year' the second one):
>  arr['Netherlands','2010']
> "

This is purely referring to the latter. This is not using labelled
axes at all but the "tick" labeling of indices. The first index is
indexing into the first axis, the second index into the second axis,
just as you said.

Please install Fernando's datarray package, play with it, read its
documentation, then come back with objections or alternatives. I
really don't think you understand what is being proposed.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco