[Numpy-discussion] BOF notes: Fernando's proposal: NumPy ndarray with named axes

Thu Jul 8 14:41:50 EDT 2010

On Thu, Jul 8, 2010 at 2:27 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Thu, Jul 8, 2010 at 1:35 PM, Rob Speer <rspeer at mit.edu> wrote:
>> Your labels are unique if you look at them the right way. Here's how I
>> would represent that in a datarray:
>> * axis0 = 'city', ['Austin', 'Boston', ...]
>> * axis1 = 'month', ['January', 'February', ...]
>> * axis2 = 'year', [1980, 1981, ...]
>> * axis3 = 'region', ['Northeast', 'South', ...]
>> * axis4 = 'measurement', ['precipitation', 'temperature']
>>
>> and then I'd make a 5-D datarray labeled with [axis0, axis1, axis2,
>> axis3, axis4].
>>
>
> Yeah, this is what I was thinking I would have to do, but it's still
> not clear to me (I have trouble trying to think in 5 dimensions...).
> For instance, what axis holds my actual numeric data?
>
> axis4, with a "precipitation" tick?

Yep, that's what I was suggesting. Or you could have two different 4-D
matrices, one whose values are precipitation and one whose values are
temperatures.

>> Now I realize not everyone wants to represent their tabular data as a
>> big tensor that they index every which way, and I think this is one
>> thing that pandas is for.
>
> This is kind of where I would like the divide to be between user and
> developer.  On top of all of this, I would like to see a __repr__ or
> something that actually spits out a 2d spreadsheet-looking
> representation.  It would help me stay sane I think.  Fernando's nice
> 3D graphic only can go so far as a mental model (for me at least).

Divisi2 uses a 2D labeled representation as its __str__ -- an example
is at http://csc.media.mit.edu/docs/divisi2/sparse.html

I could port this onto datarray. I was holding off because I was
unsure about how to represent the N-d case, but I realize now that
showing the entries in this kind of 2-D tabular format could actually
be a really intuitive way to do it.

> Mix-ins sounds reasonable to me as long as this could easily be
> accomplished.  Ie., why use csr?  Can you go between others?  Are the
> sparse matrices reasonably stable given recent activity?  Not
> rhetorical questions, I don't use sparse matrices much.

These are good questions.

I ended up using PySparse instead of scipy.sparse, because SciPy 0.7's
sparse matrices weren't ready to support many important operations,
particularly slicing. SciPy 0.8's sparse matrices look much better,
and I may transition to using them once it's released.

When planning future features of NumPy, of course, we should assume
SciPy's sparse matrices do what we want (and possibly fix them if they
don't).

csr_matrix was just an example. I think there would have to be
separate classes for labeled csr_matrices, labeled lil_matrices, and
so on, supporting all the usual methods for converting between them.
-- Rob