[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Thu Jul 8 13:35:29 EDT 2010

> Forgive me if this is has already been addressed, but my question is
> what happens when we have more than one "label" (not as in a labeled
> axis but an observation label -- but not a tick because they're not
> unique!) per say row axis and heterogenous dtypes.  This is really the
> problem that I would like to see addressed and from the BoF comments
> I'm not sure this use case is going to be covered.  I'm also not sure
> I expressed myself clearly enough or understood what's already
> available.  For me, this is the single most common use case and most
> of what we are talking about now is just convenient slicing but
> ignoring some basic and prominent concerns.  Please correct me if I'm
> wrong.  I need to play more with DataArray implementation but haven't
> had time yet.
>
> I often have data that looks like this (not really, but it gives the
> idea in a general way I think).
>
> city, month, year, region, precipitation, temperature
> "Austin", "January", 1980, "South", 12.1, 65.4,
> "Austin", "February", 1980, "South", 24.3, 55.4
> "Austin", "March", 1980, "South", 3, 69.1
> ....
> "Austin", "December", 2009, 1, 62.1
> "Boston", "January", 1980, "Northeast", 1.5, 19.2
> ....
> "Boston","December", 2009, "Northeast", 2.1, 23.5
> ...
> "Memphis","January",1980, "South", 2.1, 35.6
> ...
> "Memphis","December",2009, "South", 1.2, 33.5
> ...

Your labels are unique if you look at them the right way. Here's how I
would represent that in a datarray:
* axis0 = 'city', ['Austin', 'Boston', ...]
* axis1 = 'month', ['January', 'February', ...]
* axis2 = 'year', [1980, 1981, ...]
* axis3 = 'region', ['Northeast', 'South', ...]
* axis4 = 'measurement', ['precipitation', 'temperature']

and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
axis3, axis4].

Now I realize not everyone wants to represent their tabular data as a
big tensor that they index every which way, and I think this is one
thing that pandas is for.

Oh, and the other problem with the 5-D datarray is that you'd probably
want it to be sparse. This is another discussion worth having.

I want to eventually replace the labeling stuff in Divisi with
datarray, but sparse matrices are largely the point of using Divisi.
So how do we make a sparse datarray?

One answer would be to have datarray be a wrapper that encapsulates
any sufficiently matrix-like type. This is approximately what I did in
the now-obsolete Divisi1. Nobody liked the fact that you had to wrap
and unwrap your arrays to accomplish anything that we hadn't thought
of in writing Divisi. I would not recommend this route.

The other option, which is more like Divisi2. would be to provide the
functionality of datarray using a mixin. Then a standard dense
datarray could inherit from (np.ndarray, Datarray), while a sparse
datarray could inherit from (sparse.csr_matrix, Datarray), for
example.

-- Rob