[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Thu Jul 8 14:07:43 EDT 2010

On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer <rspeer at mit.edu> wrote:
>> Forgive me if this is has already been addressed, but my question is
>> what happens when we have more than one "label" (not as in a labeled
>> axis but an observation label -- but not a tick because they're not
>> unique!) per say row axis and heterogenous dtypes.  This is really the
>> problem that I would like to see addressed and from the BoF comments
>> I'm not sure this use case is going to be covered.  I'm also not sure
>> I expressed myself clearly enough or understood what's already
>> available.  For me, this is the single most common use case and most
>> of what we are talking about now is just convenient slicing but
>> ignoring some basic and prominent concerns.  Please correct me if I'm
>> wrong.  I need to play more with DataArray implementation but haven't
>> had time yet.
>>
>> I often have data that looks like this (not really, but it gives the
>> idea in a general way I think).
>>
>> city, month, year, region, precipitation, temperature
>> "Austin", "January", 1980, "South", 12.1, 65.4,
>> "Austin", "February", 1980, "South", 24.3, 55.4
>> "Austin", "March", 1980, "South", 3, 69.1
>> ....
>> "Austin", "December", 2009, 1, 62.1
>> "Boston", "January", 1980, "Northeast", 1.5, 19.2
>> ....
>> "Boston","December", 2009, "Northeast", 2.1, 23.5
>> ...
>> "Memphis","January",1980, "South", 2.1, 35.6
>> ...
>> "Memphis","December",2009, "South", 1.2, 33.5
>> ...
>
> Your labels are unique if you look at them the right way. Here's how I
> would represent that in a datarray:
> * axis0 = 'city', ['Austin', 'Boston', ...]
> * axis1 = 'month', ['January', 'February', ...]
> * axis2 = 'year', [1980, 1981, ...]
> * axis3 = 'region', ['Northeast', 'South', ...]
> * axis4 = 'measurement', ['precipitation', 'temperature']
>
> and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
> axis3, axis4].
>
> Now I realize not everyone wants to represent their tabular data as a
> big tensor that they index every which way, and I think this is one
> thing that pandas is for.
>
> Oh, and the other problem with the 5-D datarray is that you'd probably
> want it to be sparse. This is another discussion worth having.

I have thought quite a bit about the sparsity problem as well. I took
a first crack at a sparse data structure for panel (3D) data called
LongPanel, so basically each row has two labels, and you can fairly
efficiently convert to the dense (3D) form. It's also capable of
constructing dummy variables for a fixed effects regression. There of
course per Skipper's question you will have nearly always have
duplicate labels-- I bet it's something we could generalize. It's also
very much related to the group-by procedures we've discussed.

> I want to eventually replace the labeling stuff in Divisi with
> datarray, but sparse matrices are largely the point of using Divisi.
> So how do we make a sparse datarray?
>
> One answer would be to have datarray be a wrapper that encapsulates
> any sufficiently matrix-like type. This is approximately what I did in
> the now-obsolete Divisi1. Nobody liked the fact that you had to wrap
> and unwrap your arrays to accomplish anything that we hadn't thought
> of in writing Divisi. I would not recommend this route.
>
> The other option, which is more like Divisi2. would be to provide the
> functionality of datarray using a mixin. Then a standard dense
> datarray could inherit from (np.ndarray, Datarray), while a sparse
> datarray could inherit from (sparse.csr_matrix, Datarray), for
> example.
>
> -- Rob
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>