Mailman 3 Datarray BoF, part2 - NumPy-Discussion

Datarray BoF, part2

older
Re: [Numpy-discussion] doc string...

Keith Goodman

21 Jul 2010 21 Jul '10

9:37 a.m.

About a dozen people attended what was billed as a continuation of the SciPy 2010 datarray BoF. We met at UC Berkeley on July 19 as part of the py4science series. A datarray is a subclass of a Numpy array that adds the ability to label the axes and to label the elements along each axis. We spent most of the time discussing how to index with tick labels. The main issue is with integers: is an integer index a tick name or a position index? At the top level, datarrays always use regular Numpy indexing: an int is a position, never a label. So darr[0] always returns the first element of the datarray. The ambiguity occurs in specialized indexing methods that allow indexing by tick label name (because the name could be an int). To break the ambiguity, the proposal was to provide several tick indexing methods[1]: 1. Integers are always labels 2. Integers are never treated as labels 3. Try 1, then 2 We also discussed allowing axis labels to be any hashable object (currently only strings are allowed). The main problem: integers. Currently if an axis is labeled, say, "time", you can do darr.sum(axis="time"). What happens when an axis is labeled with an int? What does the 2 in darr.sum(axis=2) refer to? A position or a label? The same problem exists for floats since a float is (currently) a valid axis for Numpy arrays. References: [1] http://github.com/fperez/datarray/commit/3c5151baa233675b355058eb3ba028d2629...

Show replies by date

John Salvatier

21 Jul 21 Jul

9:56 a.m.

I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array? On Wed, Jul 21, 2010 at 9:37 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

About a dozen people attended what was billed as a continuation of the SciPy 2010 datarray BoF. We met at UC Berkeley on July 19 as part of the py4science series.

A datarray is a subclass of a Numpy array that adds the ability to label the axes and to label the elements along each axis.

We spent most of the time discussing how to index with tick labels. The main issue is with integers: is an integer index a tick name or a position index?

At the top level, datarrays always use regular Numpy indexing: an int is a position, never a label. So darr[0] always returns the first element of the datarray.

The ambiguity occurs in specialized indexing methods that allow indexing by tick label name (because the name could be an int). To break the ambiguity, the proposal was to provide several tick indexing methods[1]:

1. Integers are always labels 2. Integers are never treated as labels 3. Try 1, then 2

We also discussed allowing axis labels to be any hashable object (currently only strings are allowed). The main problem: integers. Currently if an axis is labeled, say, "time", you can do darr.sum(axis="time"). What happens when an axis is labeled with an int? What does the 2 in darr.sum(axis=2) refer to? A position or a label? The same problem exists for floats since a float is (currently) a valid axis for Numpy arrays.

References: [1] http://github.com/fperez/datarray/commit/3c5151baa233675b355058eb3ba028d2629... _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

10:08 a.m.

On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier <jsalvati@u.washington.edu> wrote:

...

I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code: if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i]

M Trumpis

10:58 a.m.

On Wed, Jul 21, 2010 at 10:08 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier <jsalvati@u.washington.edu> wrote:

...
I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code:

if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i]

My thoughts too.. it's dangerous to simply have a toggle that changes what darr.<slicingobject>[2] means. Much safer to have slicing options that always try to approach slicing with consistent rules. Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2 Mike

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

11:08 a.m.

On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis <mtrumpis@berkeley.edu> wrote:

...

Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2

So the axis label rule could be either: 1. str only 2. Any hashable object except int or float #1 is looking better and better. Plus you already coded it :)

Rob Speer

2:32 p.m.

I agree with the idea that axis labels must be strings. Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string. -- Rob On Wed, Jul 21, 2010 at 2:08 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 10:58 AM, M Trumpis <mtrumpis@berkeley.edu> wrote:

...
Separately, regarding the permissible axis labels, I think we must not allow any enumerated axis labels (ie, ints and floats). I don't remember if there was a consensus about that yesterday. We don't have the flexibility in the ndarray API to allow for the expression darr.method(axis=2) to mean not the 2nd dimension, but the Axis with label==2

So the axis label rule could be either:

1. str only 2. Any hashable object except int or float

#1 is looking better and better. Plus you already coded it :) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Skipper Seabold

2:38 p.m.

On Wed, Jul 21, 2010 at 5:32 PM, Rob Speer <rspeer@mit.edu> wrote:

...

I agree with the idea that axis labels must be strings.

Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string.

+1 This is what I was thinking as well. Skipper

Joshua Holbrook

2:42 p.m.

Make that +2. --Josh On Wed, Jul 21, 2010 at 1:38 PM, Skipper Seabold <jsseabold@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 5:32 PM, Rob Speer <rspeer@mit.edu> wrote:

...
I agree with the idea that axis labels must be strings.

Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string.

+1 This is what I was thinking as well.

Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

2:48 p.m.

On Wed, Jul 21, 2010 at 2:32 PM, Rob Speer <rspeer@mit.edu> wrote:

...

I agree with the idea that axis labels must be strings.

Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string.

What started the discussion was that someone wanted to have more than one label name for one axis. So I suggested that if we allow any hasable objects as axis label then, for example, a tuple could be used to hold multiple names. That would also allow a datarray to be flattened to 1d since the axis labels could be combined into a tuple. So a 2d datarray with axis names "time" and "distance" and ticks 't1', 't2' and 'd1', 'd2' could flatten to axis --> ('time', 'distance') ticks --> [('t1', 'd1'), ('t1', 'd2'), ('t2', 'd1'), ('t2', 'd2')] An unflatten function along with a fill value could unflatten the datarray.

Skipper Seabold

3 p.m.

On Wed, Jul 21, 2010 at 5:48 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 2:32 PM, Rob Speer <rspeer@mit.edu> wrote:

...
I agree with the idea that axis labels must be strings.

Yes, this is the opposite of my position on tick labels ("names"), but there's a reason: ticks are often defined by whatever data you happen to be working with, but axis labels will in the vast majority of situations be defined by the programmer as they're writing the code. If the programmer wants to name something, they'll certainly be able to do so with a string.

What started the discussion was that someone wanted to have more than one label name for one axis. So I suggested that if we allow any hasable objects as axis label then, for example, a tuple could be used to hold multiple names.

That would also allow a datarray to be flattened to 1d since the axis labels could be combined into a tuple. So a 2d datarray with axis names "time" and "distance" and ticks 't1', 't2' and 'd1', 'd2' could flatten to

axis --> ('time', 'distance') ticks --> [('t1', 'd1'), ('t1', 'd2'), ('t2', 'd1'), ('t2', 'd2')]

An unflatten function along with a fill value could unflatten the datarray.

I'm not doing the work, so really whatever works for people, but In [1]: '_'.join(('time','distance')) Out[1]: 'time_distance' would also work in this case, though I guess we get into trouble for unflatten when individual axis labels have an underscore in them. Skipper

Vincent Davis

11:41 a.m.

On Wed, Jul 21, 2010 at 11:08 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier <jsalvati@u.washington.edu> wrote:

...
I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code:

if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i]

Obviously there are several aspects of a labels that need to be considered. An important on is if an operation breaks the meaning of the labels. I like the idea of tickmap(i) it could have a lot of features like grouping... Maybe even work on structure arrays(maybe). The flag could connect the tickmap() to the array. Then if an operation was performed on the array to would result in the labels no longer being meaningful then the flag would change. In this way tickmap(i) checks for the flags and each axis could have a flag. (I am sure there is lots I am missing) Vincent

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Keith Goodman

noon

On Wed, Jul 21, 2010 at 11:41 AM, Vincent Davis <vincent@vincentdavis.net> wrote:

...

On Wed, Jul 21, 2010 at 11:08 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier <jsalvati@u.washington.edu> wrote:

...
I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code:

if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i]

Obviously there are several aspects of a labels that need to be considered. An important on is if an operation breaks the meaning of the labels. I like the idea of tickmap(i) it could have a lot of features like grouping... Maybe even work on structure arrays(maybe). The flag could connect the tickmap() to the array. Then if an operation was performed on the array to would result in the labels no longer being meaningful then the flag would change. In this way tickmap(i) checks for the flags and each axis could have a flag. (I am sure there is lots I am missing)

Each axis currently has a tick map: Axis._tick_dict. From a datarray you can access it like this:

...

...
from datarray import DataArray x = DataArray([1,2,3], labels=[('time', ['n1', 'n2', 'n3'])]) x.axis.time._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axes[0]._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axis['time']._tick_dict {'n1': 0, 'n2': 1, 'n3': 2}

On a separate note, I think having both axis and axes attribute is confusing. Would it be possible to only have one of them? Here's a proposal: http://github.com/fperez/datarray/commit/01b2d3d2082ade38ec89dbca0c070dd4fc9...

Vincent Davis

12:14 p.m.

On Wed, Jul 21, 2010 at 1:00 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

...

On Wed, Jul 21, 2010 at 11:41 AM, Vincent Davis <vincent@vincentdavis.net> wrote:

...
On Wed, Jul 21, 2010 at 11:08 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

...
On Wed, Jul 21, 2010 at 9:56 AM, John Salvatier <jsalvati@u.washington.edu> wrote:

...
I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

It's an interesting idea. My guess is that you'd end up having to check the attribute all the time when writing code:

if dar.intaslabel: dar2 = dar[tickmap(i)] else: dar2 = dar[i]

Obviously there are several aspects of a labels that need to be considered. An important on is if an operation breaks the meaning of the labels. I like the idea of tickmap(i) it could have a lot of features like grouping... Maybe even work on structure arrays(maybe). The flag could connect the tickmap() to the array. Then if an operation was performed on the array to would result in the labels no longer being meaningful then the flag would change. In this way tickmap(i) checks for the flags and each axis could have a flag. (I am sure there is lots I am missing)

Each axis currently has a tick map: Axis._tick_dict. From a datarray you can access it like this:

...
...
from datarray import DataArray x = DataArray([1,2,3], labels=[('time', ['n1', 'n2', 'n3'])]) x.axis.time._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axes[0]._tick_dict {'n1': 0, 'n2': 1, 'n3': 2} x.axis['time']._tick_dict {'n1': 0, 'n2': 1, 'n3': 2}

I was thinking of a more universal tickmap(). Something where I can define the map/lables ie labels=[('time', ['n1', 'n2', 'n3'])]) And apply it to any array of the right size. That is I don't make a special array (DataArray). Then the problem becomes know if the labels are sill valid after other functions operate on the array. So this is where I was thinking of having a flag (array attribute?) This could just be left to the user. My only point here is a universal tickmap() function might be nice. Vincent

...

On a separate note, I think having both axis and axes attribute is confusing. Would it be possible to only have one of them? Here's a proposal: http://github.com/fperez/datarray/commit/01b2d3d2082ade38ec89dbca0c070dd4fc9... _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Bruce Southey

10:41 a.m.

On 07/21/2010 11:56 AM, John Salvatier wrote:

...

I don't really know much about this topic, but what about a flag at array creation time (or whenever you define labels) that says whether valid indexes will be treated as labels or indexes for that array?

On Wed, Jul 21, 2010 at 9:37 AM, Keith Goodman <kwgoodman@gmail.com <mailto:kwgoodman@gmail.com>> wrote:

About a dozen people attended what was billed as a continuation of the SciPy 2010 datarray BoF. We met at UC Berkeley on July 19 as part of the py4science series.

A datarray is a subclass of a Numpy array that adds the ability to label the axes and to label the elements along each axis.

We spent most of the time discussing how to index with tick labels. The main issue is with integers: is an integer index a tick name or a position index?

At the top level, datarrays always use regular Numpy indexing: an int is a position, never a label. So darr[0] always returns the first element of the datarray.

The ambiguity occurs in specialized indexing methods that allow indexing by tick label name (because the name could be an int). To break the ambiguity, the proposal was to provide several tick indexing methods[1]:

1. Integers are always labels 2. Integers are never treated as labels 3. Try 1, then 2

We also discussed allowing axis labels to be any hashable object (currently only strings are allowed). The main problem: integers. Currently if an axis is labeled, say, "time", you can do darr.sum(axis="time"). What happens when an axis is labeled with an int? What does the 2 in darr.sum(axis=2) refer to? A position or a label? The same problem exists for floats since a float is (currently) a valid axis for Numpy arrays.

References: [1] http://github.com/fperez/datarray/commit/3c5151baa233675b355058eb3ba028d2629... _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

The current implemented option of allowing strings is the only practical option and I think that most other related languages also impose this constraint. Otherwise we will effectively break compatibility with Python and numpy because darr[0] can result in different answers depending on the type of object involved - especially if you are using views and forget the actual object type. I do think that we do have to avoid adding complexity that increases runtime like looking for the label 2 when it should be the second axis. Also we have to avoid situations that lead to input errors like flag values or extra arguments. Bruce

Keith Goodman

10:50 a.m.

On Wed, Jul 21, 2010 at 10:41 AM, Bruce Southey <bsouthey@gmail.com> wrote:

...

The current implemented option of allowing strings is the only practical option and I think that most other related languages also impose this constraint. Otherwise we will effectively break compatibility with Python and numpy because darr[0] can result in different answers depending on the type of object involved - especially if you are using views and forget the actual object type.

There are axis labels (currently string only) and there are tick labels (currently anything but int). darr[0] will always return the first element of the datarray. No indexing by tick label is allowed. To index by tick label you'd have to use a special method like darr.lix[...]. Would that be OK?

...

I do think that we do have to avoid adding complexity that increases runtime like looking for the label 2 when it should be the second axis. Also we have to avoid situations that lead to input errors like flag values or extra arguments.

It's nice to make thing general by allowing any hashable object to label an axis. But I agree with you that we have to watch the cost of doing so. Nothing was decided as final at the metting. We only discussed options.

5292

Age (days ago)

5292

Last active (days ago)

List overview

Download

14 comments

8 participants

participants (8)

Bruce Southey
John Salvatier
Joshua Holbrook
Keith Goodman
M Trumpis
Rob Speer
Skipper Seabold
Vincent Davis

Datarray BoF, part2

Keith Goodman

John Salvatier

Keith Goodman

M Trumpis

Keith Goodman

Rob Speer

Skipper Seabold

Joshua Holbrook

Keith Goodman

Skipper Seabold

Vincent Davis

Keith Goodman

Vincent Davis

Bruce Southey

Keith Goodman

tags

participants (8)