Mailman 3 Enum/Factor NEP (now with code) - NumPy-Discussion

Enum/Factor NEP (now with code)

older
Pull request: Split maskna support...

Bryan Van de Ven

June 12, 2012

5:27 p.m.

Hi all, It has been some time, but I do have an update regarding this proposed feature. I thought it would be helpful to flesh out some parts of a possible implementation to learn what can be spelled reasonably in NumPy. Mark Wiebe helped out greatly in navigating the NumPy code codebase. Here is a link to my branch with this code; https://github.com/bryevdv/numpy/tree/enum and the updated NEP: https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst Not everything in the NEP is implemented (integral levels and natural naming in particular) and some parts definitely need more fleshing out. However, things currently work basically as described in the NEP, and there is also a small set of tests that demonstrate current usage. A few things will crash python (astype especially). More tests are needed. I would appreciate as much feedback and discussion as you can provide! Thanks, Bryan Van de Ven

Show replies by date

Nathaniel Smith

June 2012

9:33 a.m.

On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:

...

Hi all,

It has been some time, but I do have an update regarding this proposed feature. I thought it would be helpful to flesh out some parts of a possible implementation to learn what can be spelled reasonably in NumPy. Mark Wiebe helped out greatly in navigating the NumPy code codebase. Here is a link to my branch with this code;

https://github.com/bryevdv/numpy/tree/enum

and the updated NEP:

https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst

Not everything in the NEP is implemented (integral levels and natural naming in particular) and some parts definitely need more fleshing out. However, things currently work basically as described in the NEP, and there is also a small set of tests that demonstrate current usage. A few things will crash python (astype especially). More tests are needed. I would appreciate as much feedback and discussion as you can provide!

Hi Bryan, I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.

...

From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:

subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ... With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".) I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve the problem it's designed for. Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think? I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type? Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from. I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this. I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems. Hope that helps -- it's exciting to see someone working on this, and you seem to be off to a good start! -N

Dag Sverre Seljebotn

12:04 p.m.

On 06/13/2012 03:33 PM, Nathaniel Smith wrote:

...

On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven<bryanv@continuum.io> wrote:

...
Hi all,

It has been some time, but I do have an update regarding this proposed feature. I thought it would be helpful to flesh out some parts of a possible implementation to learn what can be spelled reasonably in NumPy. Mark Wiebe helped out greatly in navigating the NumPy code codebase. Here is a link to my branch with this code;

https://github.com/bryevdv/numpy/tree/enum

and the updated NEP:

https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst

Not everything in the NEP is implemented (integral levels and natural naming in particular) and some parts definitely need more fleshing out. However, things currently work basically as described in the NEP, and there is also a small set of tests that demonstrate current usage. A few things will crash python (astype especially). More tests are needed. I would appreciate as much feedback and discussion as you can provide!

Hi Bryan,

I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.

From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:

subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ...

With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".)

I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve the problem it's designed for.

If one wants to have an "open", hassle-free enum, an alternative would be to cryptographically hash the enum string. I'd trust 64 bits of hash for this purpose. The obvious disadvantage is the extra space used, but it'd be a bit more hassle-free compared to regular enums; you'd never have to fix the set of enum strings and they'd always be directly comparable across different arrays. HDF libraries etc. could compress it at the storage layer, storing the enum mapping in the metadata. Just a thought. Dag

...

Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think?

I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type?

Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from.

I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this.

I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems.

Hope that helps -- it's exciting to see someone working on this, and you seem to be off to a good start!

-N _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

12:23 p.m.

On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

...

On 06/13/2012 03:33 PM, Nathaniel Smith wrote:

...
I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve the problem it's designed for.

If one wants to have an "open", hassle-free enum, an alternative would be to cryptographically hash the enum string. I'd trust 64 bits of hash for this purpose.

The obvious disadvantage is the extra space used, but it'd be a bit more hassle-free compared to regular enums; you'd never have to fix the set of enum strings and they'd always be directly comparable across different arrays. HDF libraries etc. could compress it at the storage layer, storing the enum mapping in the metadata.

You'd trust 64 bits to be collision-free for all strings ever stored in numpy, eternally? I wouldn't. Anyway, if the goal is to store an arbitrary set of strings in 64 bits apiece, then there is no downside to just using an object array + interning (like pandas does now), and this *is* guaranteed to be collision free. Maybe it would be useful to have a "heap string" dtype, but that'd be something different. AFAIK all the cases where an explicit categorical type adds value over this are the ones where having an explicit set of levels is useful. Representing HDF5 enums or R factors requires a way to specify arbitrary string<->integer mappings, and there are algorithms (e.g. in charlton) that are much more efficient if they can figure out what the set of possible levels is directly without scanning the whole array. -N

Dag Sverre Seljebotn

12:48 p.m.

Nathaniel Smith <njs@pobox.com> wrote:

...

...
On 06/13/2012 03:33 PM, Nathaniel Smith wrote:

...
I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve

On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote: the

...
...
problem it's designed for.

If one wants to have an "open", hassle-free enum, an alternative would be to cryptographically hash the enum string. I'd trust 64 bits of hash for this purpose.

The obvious disadvantage is the extra space used, but it'd be a bit more hassle-free compared to regular enums; you'd never have to fix the set of enum strings and they'd always be directly comparable across different arrays. HDF libraries etc. could compress it at the storage layer, storing the enum mapping in the metadata.

You'd trust 64 bits to be collision-free for all strings ever stored in numpy, eternally? I wouldn't. Anyway, if the goal is to store an arbitrary set of strings in 64 bits apiece, then there is no downside to just using an object array + interning (like pandas does now), and this *is* guaranteed to be collision free. Maybe it would be useful to have a "heap string" dtype, but that'd be something different.

Heh, we've been having this discussion before :-) The 'interned heap string dtype' may be something different, but it could be something that could meet the 'open enum' usecases (assuming they exist) in a better way than making enums complicated. Consider it a backup strategy if one can't put the open enum idea dead otherwise..

...

AFAIK all the cases where an explicit categorical type adds value over this are the ones where having an explicit set of levels is useful. Representing HDF5 enums or R factors requires a way to specify arbitrary string<->integer mappings, and there are algorithms (e.g. in charlton) that are much more efficient if they can figure out what the set of possible levels is directly without scanning the whole array.

For interned strings, the set of strings present could be stored in the array in principle (though I guess it would be very difficult to implement in current numpy). The perfect hash schemes we've explored on the Cython list lately uses around 10-20 microseconds on my 1.8 GHz for 64-element table rehashing (worst case insertion, happens more often than with insertion in regular hash tables) and 0.5-2 nanoseconds for a lookup in L1 (which always hits on first try if the entry is in the table). Dag

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Bryan Van de Ven

12:44 p.m.

On 6/13/12 8:33 AM, Nathaniel Smith wrote:

...

Hi Bryan,

I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.

I'm not quite sure what happened there, my git skills are not advanced by any measure. I think the datetime changes are a much smaller fraction than fifty percent, but I will see what I can do to separate them out in the near future.

...

From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:

subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ...

With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some

I imagine users using the same open dtype object in both fields of the structure dtype used to read in the file, if both fields of the file contain the same categories. If they don't contain the same categories, they are incomparable in any case. I believe many users have this simpler use case where each field is a separate category, and they want to read them all individually, separately on the fly. For these simple cases, it would "just work". For your case example there would definitely be a documentation, examples, tutorials, education issue, to avoid the "gotcha" you describe.

...

nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".)

A solution is to create the dtype object when reading in the first file, and to reuse that same dtype object when reading in subsequent files. Perhaps it's not ideal, but it does enable the work to be done.

...

Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think?

I suppose I agree with the sentiment that the core of NumPy really ought to be less dependent on the Python C API, not more. I also think the khash API is pretty dead simple and straightforward, and the fact that it is contained in a singe header is attractive. It's also quite performant in time and space. But if others disagree strongly, all of it's uses are hidden behind the interface in leveled_dtypes.c, it could be replaced with some other mechanism easily enough.

...

I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type?

I'm not certain they are worthwhile either, which is why I did not spend any time on them yet. Wes has expressed a desire for very broad categorical types (even more than just scalar categories), hopefully he can chime in with his motivations.

...

Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from.

I think I like "categorical" over "factor" but I am not sure we should ditch "enum". There are two different use cases here: I have a pile of strings (or scalars) that I want to treat as discrete things (categories), and: I have a pile of numbers that I want to give convenient or meaningful names to (enums). This latter case was the motivation for possibly adding "Natural Naming".

...

I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this.

This could certainly be improved, I agree.

...

I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems.

Francesc is certainly aware of this work, and I emailed Wes earlier this week, I probably should have mentioned that, though. Hopefully they will have time to contribute their thoughts. I also imagine Travis can speak on behalf of the users he has interacted with over the last several years that have requested this feature that don't happen to follow mailing lists. Thanks, Bryan

Nathaniel Smith

2:12 p.m.

On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:

...

On 6/13/12 8:33 AM, Nathaniel Smith wrote:

...
Hi Bryan,

I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.

I'm not quite sure what happened there, my git skills are not advanced by any measure. I think the datetime changes are a much smaller fraction than fifty percent, but I will see what I can do to separate them out in the near future.

Looking again, it looks like a lot of it is actually because when I asked github to show me the diff between your branch and master, it showed me the diff between your branch and *your repository's* version of master. And your branch is actually based off a newer version of 'master' than you have in your repository. So, as far as git and github are concerned, all those changes that are included in your-branch's-base-master but not in your-repo's-master are new stuff that you did on your branch. Solution is just to do git push <your github remote name> master

...

...
From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:

subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ...

With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some

I imagine users using the same open dtype object in both fields of the structure dtype used to read in the file, if both fields of the file contain the same categories. If they don't contain the same categories, they are incomparable in any case. I believe many users have this simpler use case where each field is a separate category, and they want to read them all individually, separately on the fly. For these simple cases, it would "just work". For your case example there would definitely be a documentation, examples, tutorials, education issue, to avoid the "gotcha" you describe.

Yes, of course we *could* write the code to implement these "open" dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot.

...

...
nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".)

A solution is to create the dtype object when reading in the first file, and to reuse that same dtype object when reading in subsequent files. Perhaps it's not ideal, but it does enable the work to be done.

So would a proper implementation of np.fromfile that normalized the level ordering.

...

...
Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think?

I suppose I agree with the sentiment that the core of NumPy really ought to be less dependent on the Python C API, not more. I also think the khash API is pretty dead simple and straightforward, and the fact that it is contained in a singe header is attractive. It's also quite performant in time and space. But if others disagree strongly, all of it's uses are hidden behind the interface in leveled_dtypes.c, it could be replaced with some other mechanism easily enough.

I'm not at all convinced by the argument that throwing in random redundant data types into NumPy will somehow reduce our dependence on the Python C API. If you have a plan to replace *all* use of dicts in numpy with khash, then we can talk about that, I guess. But that would be a separate patch, and I don't think using PyDict in this patch would really have any effect on how difficult that separate patch was to do. PyDict also has a very simple API -- and in fact, the comparison is between the PyDict API+the khash API, versus just the PyDict API alone, since everyone working with the Python C API already has to know how that works. It's also contained in effectively zero header files, which is even more attractive than one header file. And that interface in leveled_dtypes.c is the one that I was talking about being larger than my entire categorical dtype implementation. None of this means that using it is a bad idea, of course! Maybe it has some key advantage over PyDict in terms of memory use or something, for those people who have hundreds of thousands of distinct categories in their data, I don't know. But all your arguments here seem to be of the form "hey, it's not *that* bad", and it seems like there must be some actual affirmative advantages it has over PyDict if it's going to be worth using.

...

...
I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type?

I'm not certain they are worthwhile either, which is why I did not spend any time on them yet. Wes has expressed a desire for very broad categorical types (even more than just scalar categories), hopefully he can chime in with his motivations.

...
Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from.

I think I like "categorical" over "factor" but I am not sure we should ditch "enum". There are two different use cases here: I have a pile of strings (or scalars) that I want to treat as discrete things (categories), and: I have a pile of numbers that I want to give convenient or meaningful names to (enums). This latter case was the motivation for possibly adding "Natural Naming".

So mention the word "enum" in the documentation, so people looking for that will find the categorical data support? :-)

...

...
I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this.

This could certainly be improved, I agree.

I don't want to be Mr. Grumpypants here, but I do want to make sure we're speaking the same language: what "-1" means is "I consider this a show-stopper and will oppose merging any code that does not improve on this". (Of course you also always have the option of trying to change my mind. Even Mr. Grumpypants can be swayed by logic!)

...

...
I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems.

Francesc is certainly aware of this work, and I emailed Wes earlier this week, I probably should have mentioned that, though. Hopefully they will have time to contribute their thoughts. I also imagine Travis can speak on behalf of the users he has interacted with over the last several years that have requested this feature that don't happen to follow mailing lists.

I'm glad Francesc and Wes are aware of the work, but my point was that that isn't enough. So if I were in your position and hoping to get this code merged, I'd be trying to figure out how to get them more actively on board? -N

Wes McKinney

2:54 p.m.

On Wed, Jun 13, 2012 at 2:12 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:

...
On 6/13/12 8:33 AM, Nathaniel Smith wrote:

...
Hi Bryan,

I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.

I'm not quite sure what happened there, my git skills are not advanced by any measure. I think the datetime changes are a much smaller fraction than fifty percent, but I will see what I can do to separate them out in the near future.

Looking again, it looks like a lot of it is actually because when I asked github to show me the diff between your branch and master, it showed me the diff between your branch and *your repository's* version of master. And your branch is actually based off a newer version of 'master' than you have in your repository. So, as far as git and github are concerned, all those changes that are included in your-branch's-base-master but not in your-repo's-master are new stuff that you did on your branch. Solution is just to do git push <your github remote name> master

...
...
From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:

subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ...

With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some

I imagine users using the same open dtype object in both fields of the structure dtype used to read in the file, if both fields of the file contain the same categories. If they don't contain the same categories, they are incomparable in any case. I believe many users have this simpler use case where each field is a separate category, and they want to read them all individually, separately on the fly. For these simple cases, it would "just work". For your case example there would definitely be a documentation, examples, tutorials, education issue, to avoid the "gotcha" you describe.

Yes, of course we *could* write the code to implement these "open" dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot.

...
...
nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".)

A solution is to create the dtype object when reading in the first file, and to reuse that same dtype object when reading in subsequent files. Perhaps it's not ideal, but it does enable the work to be done.

So would a proper implementation of np.fromfile that normalized the level ordering.

...
...
Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think?

I suppose I agree with the sentiment that the core of NumPy really ought to be less dependent on the Python C API, not more. I also think the khash API is pretty dead simple and straightforward, and the fact that it is contained in a singe header is attractive. It's also quite performant in time and space. But if others disagree strongly, all of it's uses are hidden behind the interface in leveled_dtypes.c, it could be replaced with some other mechanism easily enough.

I'm not at all convinced by the argument that throwing in random redundant data types into NumPy will somehow reduce our dependence on the Python C API. If you have a plan to replace *all* use of dicts in numpy with khash, then we can talk about that, I guess. But that would be a separate patch, and I don't think using PyDict in this patch would really have any effect on how difficult that separate patch was to do.

PyDict also has a very simple API -- and in fact, the comparison is between the PyDict API+the khash API, versus just the PyDict API alone, since everyone working with the Python C API already has to know how that works. It's also contained in effectively zero header files, which is even more attractive than one header file. And that interface in leveled_dtypes.c is the one that I was talking about being larger than my entire categorical dtype implementation.

None of this means that using it is a bad idea, of course! Maybe it has some key advantage over PyDict in terms of memory use or something, for those people who have hundreds of thousands of distinct categories in their data, I don't know. But all your arguments here seem to be of the form "hey, it's not *that* bad", and it seems like there must be some actual affirmative advantages it has over PyDict if it's going to be worth using.

...
...
I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type?

I'm not certain they are worthwhile either, which is why I did not spend any time on them yet. Wes has expressed a desire for very broad categorical types (even more than just scalar categories), hopefully he can chime in with his motivations.

...
Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from.

I think I like "categorical" over "factor" but I am not sure we should ditch "enum". There are two different use cases here: I have a pile of strings (or scalars) that I want to treat as discrete things (categories), and: I have a pile of numbers that I want to give convenient or meaningful names to (enums). This latter case was the motivation for possibly adding "Natural Naming".

So mention the word "enum" in the documentation, so people looking for that will find the categorical data support? :-)

...
...
I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this.

This could certainly be improved, I agree.

I don't want to be Mr. Grumpypants here, but I do want to make sure we're speaking the same language: what "-1" means is "I consider this a show-stopper and will oppose merging any code that does not improve on this". (Of course you also always have the option of trying to change my mind. Even Mr. Grumpypants can be swayed by logic!)

...
...
I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems.

Francesc is certainly aware of this work, and I emailed Wes earlier this week, I probably should have mentioned that, though. Hopefully they will have time to contribute their thoughts. I also imagine Travis can speak on behalf of the users he has interacted with over the last several years that have requested this feature that don't happen to follow mailing lists.

I'm glad Francesc and Wes are aware of the work, but my point was that that isn't enough. So if I were in your position and hoping to get this code merged, I'd be trying to figure out how to get them more actively on board?

-N _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

OK, I need to spend some time on this as it will directly impact me. Random thoughts here. It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object (note, I'm going to change the name of the Factor class to Categorical before 0.8.0 final per discussion with Nathaniel): In [2]: Factor.from_array(np.random.randint(0, 10, 100)) Out[2]: Factor: array([6, 6, 4, 2, 1, 2, 3, 5, 1, 5, 2, 9, 2, 8, 8, 1, 5, 2, 6, 9, 2, 1, 3, 6, 4, 4, 8, 1, 3, 1, 7, 9, 6, 4, 8, 0, 2, 9, 6, 2, 0, 6, 7, 5, 1, 7, 8, 2, 7, 9, 7, 6, 5, 8, 3, 9, 4, 5, 0, 1, 4, 1, 8, 8, 6, 8, 0, 2, 2, 7, 0, 9, 9, 9, 4, 6, 4, 1, 8, 6, 3, 3, 2, 5, 3, 9, 9, 0, 0, 7, 2, 1, 6, 0, 7, 6, 6, 0, 7, 5]) Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [3]: Factor.from_array(np.random.randint(0, 10, 100)).levels Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [4]: Factor.from_array(np.random.randint(0, 10, 100)).labels Out[4]: array([0, 4, 3, 6, 6, 0, 6, 2, 2, 6, 2, 4, 7, 4, 1, 8, 1, 4, 8, 6, 4, 5, 6, 4, 8, 3, 9, 5, 3, 0, 4, 2, 7, 0, 1, 8, 0, 7, 8, 6, 5, 6, 1, 6, 2, 7, 8, 5, 7, 5, 1, 5, 0, 5, 6, 5, 5, 4, 0, 3, 3, 8, 5, 1, 1, 2, 6, 7, 7, 1, 6, 6, 4, 4, 8, 2, 1, 7, 8, 3, 7, 8, 1, 5, 0, 6, 9, 9, 9, 5, 7, 3, 1, 2, 0, 1, 5, 6, 4, 5]) The API for constructing an enum/factor/categorical array from fixed levels and an array of labels seems somewhat weak to me. A very common scenario is to need to construct a factor from an array of integers with an associated array of levels: In [13]: labels Out[13]: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) In [14]: levels Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [15]: Factor(labels, levels) Out[15]: Factor: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations. Are the levels ordered (Nathaniel brought this up already looks like)? It doesn't look like it. That is also necessary. You also need to be able to sort the levels (which is a relabeling, I have lots of code in use for this). In the context of groupby in pandas, when processing a key (array of values) to a factor to be used for aggregating some data, you have the option of returning an object that has the levels as observed in the data or sorting. Sorting can obviously be very expensive depending on the number of groups in the data (http://wesmckinney.com/blog/?p=437). Example: from pandas import DataFrame from pandas.util.testing import rands import numpy as np df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10, 'data' : np.random.randn(1000000)}) In [32]: timeit df.groupby('key').sum() 1 loops, best of 3: 374 ms per loop In [33]: timeit df.groupby('key', sort=False).sum() 10 loops, best of 3: 185 ms per loop The "factorization time" for the `key` column dominates the runtime; the factor is computed once then reused if you keep the GroupBy object around: In [36]: timeit grouped.sum() 100 loops, best of 3: 6.05 ms per loop As another example of why ordered factors matter, consider a quantile cut (google for the "cut" function in R) function I wrote recently: In [40]: arr = Series(np.random.randn(1000000)) In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1]) In [43]: arr.groupby(cats).describe().unstack(0) Out[43]: (-4.85, -0.673] (-0.673, 0.00199] (0.00199, 0.677] (0.677, 4.914] count 250000.000000 250000.000000 250000.000000 250000.000000 mean -1.270623 -0.323092 0.326325 1.271519 std 0.491317 0.193254 0.193044 0.490611 min -4.839798 -0.673224 0.001992 0.677177 25% -1.533021 -0.487450 0.158736 0.888502 50% -1.150136 -0.317501 0.320352 1.150480 75% -0.887974 -0.155197 0.490456 1.534709 max -0.673224 0.001990 0.677176 4.913536 If you don't have ordered levels, then the quantiles might come out in the wrong order depending on how the strings sort or fall out of the hash table. Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly. Anyway, this is exciting work assuming we get the API right and hitting all the use cases. On top of all this I am _very_ performance sensitive so you'll have to be pretty aggressive with benchmarking things. I have concerns about ceding control over critical functionality that I need for pandas (which has become a large and very important library these days for a lot of people), but as long as the pieces in NumPy are suitably mature and robust for me to switch to them eventually that would be great. I'll do my best to stay involved in the discussion, though I'm juggling a lot of things these days (e.g. I have the PyData book deadline approaching like a freight train). - Wes

Bryan Van de Ven

5:19 p.m.

On 6/13/12 1:54 PM, Wes McKinney wrote:

...

OK, I need to spend some time on this as it will directly impact me. Random thoughts here.

It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object (note, I'm going to change the name of the Factor class to Categorical before 0.8.0 final per discussion with Nathaniel):

The current for-discussion prototype currently only supports strings. I had mentioned integral levels in the NEP but wanted to get more feedback first. It looks like you are using intervals as levels in things like qcut? This would add some complexity. I can think of a couple of possible approaches I will have to try a few of them out to see what would make the most sense.

...

The API for constructing an enum/factor/categorical array from fixed levels and an array of labels seems somewhat weak to me. A very common scenario is to need to construct a factor from an array of integers with an associated array of levels:

In [13]: labels Out[13]: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4])

In [14]: levels Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]: Factor(labels, levels) Out[15]: Factor: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

I originally had a very similar interface in the NEP. I was persuaded by Mark that this would be redundant: In [10]: levels = np.factor(['a', 'b', 'c']) # or levels = np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels) Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 2, 'a': 0, 'b': 1})') This should also spell even more closely to your example as: labels.astype(levels) but I have not done much with casting yet, so this currently complains. However, would this satisfy your needs (modulo the separate question about more general integral or object levels)?

...

What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations. I am just using INT32_MIN at the moment. Are the levels ordered (Nathaniel brought this up already looks like)? It doesn't look like it. That is also necessary. You also need to be

They currently compare based on their value: In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], np.factor({'c':0, 'b':1, 'a':2})) In [21]: arr Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 0, 'a': 2, 'b': 1})') In [22]: arr.sort() In [23]: arr Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], dtype='factor({'c': 0, 'a': 2, 'b': 1})')

...

able to sort the levels (which is a relabeling, I have lots of code in use for this). In the context of groupby in pandas, when processing a key (array of values) to a factor to be used for aggregating some data, you have the option of returning an object that has the levels as observed in the data or sorting. Sorting can obviously be very expensive depending on the number of groups in the data (http://wesmckinney.com/blog/?p=437). Example:

from pandas import DataFrame from pandas.util.testing import rands import numpy as np

df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10, 'data' : np.random.randn(1000000)})

In [32]: timeit df.groupby('key').sum() 1 loops, best of 3: 374 ms per loop

In [33]: timeit df.groupby('key', sort=False).sum() 10 loops, best of 3: 185 ms per loop

The "factorization time" for the `key` column dominates the runtime; the factor is computed once then reused if you keep the GroupBy object around:

In [36]: timeit grouped.sum() 100 loops, best of 3: 6.05 ms per loop Just some numbers for comparison. Factorization times:

...

As another example of why ordered factors matter, consider a quantile cut (google for the "cut" function in R) function I wrote recently:

In [40]: arr = Series(np.random.randn(1000000))

In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [43]: arr.groupby(cats).describe().unstack(0) Out[43]: (-4.85, -0.673] (-0.673, 0.00199] (0.00199, 0.677] (0.677, 4.914] count 250000.000000 250000.000000 250000.000000 250000.000000 mean -1.270623 -0.323092 0.326325 1.271519 std 0.491317 0.193254 0.193044 0.490611 min -4.839798 -0.673224 0.001992 0.677177 25% -1.533021 -0.487450 0.158736 0.888502 50% -1.150136 -0.317501 0.320352 1.150480 75% -0.887974 -0.155197 0.490456 1.534709 max -0.673224 0.001990 0.677176 4.913536

If you don't have ordered levels, then the quantiles might come out in the wrong order depending on how the strings sort or fall out of the hash table. We do have ordered levels. :) Now, there's currently no way to get a

In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] In [42]: levels = np.factor(lets) In [43]: data = [lets[int(x)] for x in np.random.randn(1000000)] In [44]: %timeit np.array(data, levels) 10 loops, best of 3: 137 ms per loop And retrieving group indicies/summing: In [8]: %timeit arr=='a' 1000 loops, best of 3: 1.52 ms per loop In [10]: vals = np.random.randn(1000000) In [20]: inds = [arr==x for x in lets] In [23]: %timeit for ind in inds: vals[ind].sum() 10 loops, best of 3: 48.3 ms per loop On my laptop your grouped.sum() took 22ms, so this is roughly off by about a factor of two. But we should compare it on the same hardware, and with the same level data types. There is no doubt room for improvement, though. It would not be too bad to add some groupby functionality on top of this. I still need to add a mechanism for accessing and iterating over the levels. list of the levels, in order, but that should be trivial to add.

...

Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly.

Anyway, this is exciting work assuming we get the API right and hitting all the use cases. On top of all this I am _very_ performance sensitive so you'll have to be pretty aggressive with benchmarking things. I have concerns about ceding control over critical functionality that I need for pandas (which has become a large and very important library these days for a lot of people), but as long as the pieces in NumPy are suitably mature and robust for me to switch to them eventually that would be great.

I'll do my best to stay involved in the discussion, though I'm juggling a lot of things these days (e.g. I have the PyData book deadline approaching like a freight train).

- Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Wes McKinney

6:11 p.m.

On Wed, Jun 13, 2012 at 5:19 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:

...

On 6/13/12 1:54 PM, Wes McKinney wrote:

...
OK, I need to spend some time on this as it will directly impact me. Random thoughts here.

It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object (note, I'm going to change the name of the Factor class to Categorical before 0.8.0 final per discussion with Nathaniel):

The current for-discussion prototype currently only supports strings. I had mentioned integral levels in the NEP but wanted to get more feedback first. It looks like you are using intervals as levels in things like qcut? This would add some complexity. I can think of a couple of possible approaches I will have to try a few of them out to see what would make the most sense.

...
The API for constructing an enum/factor/categorical array from fixed levels and an array of labels seems somewhat weak to me. A very common scenario is to need to construct a factor from an array of integers with an associated array of levels:

In [13]: labels Out[13]: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4])

In [14]: levels Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]: Factor(labels, levels) Out[15]: Factor: array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7, 1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8, 0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1, 0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8, 7, 9, 7, 3, 3, 0, 4, 4]) Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

I originally had a very similar interface in the NEP. I was persuaded by Mark that this would be redundant:

In [10]: levels = np.factor(['a', 'b', 'c']) # or levels = np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels) Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 2, 'a': 0, 'b': 1})')

This should also spell even more closely to your example as:

labels.astype(levels)

but I have not done much with casting yet, so this currently complains. However, would this satisfy your needs (modulo the separate question about more general integral or object levels)?

...
What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations. I am just using INT32_MIN at the moment. Are the levels ordered (Nathaniel brought this up already looks like)? It doesn't look like it. That is also necessary. You also need to be

They currently compare based on their value:

In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], np.factor({'c':0, 'b':1, 'a':2})) In [21]: arr Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], dtype='factor({'c': 0, 'a': 2, 'b': 1})') In [22]: arr.sort() In [23]: arr Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], dtype='factor({'c': 0, 'a': 2, 'b': 1})')

...
able to sort the levels (which is a relabeling, I have lots of code in use for this). In the context of groupby in pandas, when processing a key (array of values) to a factor to be used for aggregating some data, you have the option of returning an object that has the levels as observed in the data or sorting. Sorting can obviously be very expensive depending on the number of groups in the data (http://wesmckinney.com/blog/?p=437). Example:

from pandas import DataFrame from pandas.util.testing import rands import numpy as np

df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10, 'data' : np.random.randn(1000000)})

In [32]: timeit df.groupby('key').sum() 1 loops, best of 3: 374 ms per loop

In [33]: timeit df.groupby('key', sort=False).sum() 10 loops, best of 3: 185 ms per loop

The "factorization time" for the `key` column dominates the runtime; the factor is computed once then reused if you keep the GroupBy object around:

In [36]: timeit grouped.sum() 100 loops, best of 3: 6.05 ms per loop Just some numbers for comparison. Factorization times:

In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] In [42]: levels = np.factor(lets) In [43]: data = [lets[int(x)] for x in np.random.randn(1000000)] In [44]: %timeit np.array(data, levels) 10 loops, best of 3: 137 ms per loop

And retrieving group indicies/summing:

In [8]: %timeit arr=='a' 1000 loops, best of 3: 1.52 ms per loop In [10]: vals = np.random.randn(1000000) In [20]: inds = [arr==x for x in lets] In [23]: %timeit for ind in inds: vals[ind].sum() 10 loops, best of 3: 48.3 ms per loop

(FYI you're comparing an O(NK) algorithm with an O(N) algorithm for small K)

...

On my laptop your grouped.sum() took 22ms, so this is roughly off by about a factor of two. But we should compare it on the same hardware, and with the same level data types. There is no doubt room for improvement, though.

It would not be too bad to add some groupby functionality on top of this. I still need to add a mechanism for accessing and iterating over the levels.

...
As another example of why ordered factors matter, consider a quantile cut (google for the "cut" function in R) function I wrote recently:

In [40]: arr = Series(np.random.randn(1000000))

In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [43]: arr.groupby(cats).describe().unstack(0) Out[43]: (-4.85, -0.673] (-0.673, 0.00199] (0.00199, 0.677] (0.677, 4.914] count 250000.000000 250000.000000 250000.000000 250000.000000 mean -1.270623 -0.323092 0.326325 1.271519 std 0.491317 0.193254 0.193044 0.490611 min -4.839798 -0.673224 0.001992 0.677177 25% -1.533021 -0.487450 0.158736 0.888502 50% -1.150136 -0.317501 0.320352 1.150480 75% -0.887974 -0.155197 0.490456 1.534709 max -0.673224 0.001990 0.677176 4.913536

If you don't have ordered levels, then the quantiles might come out in the wrong order depending on how the strings sort or fall out of the hash table. We do have ordered levels. :) Now, there's currently no way to get a list of the levels, in order, but that should be trivial to add.

...
Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly.

Anyway, this is exciting work assuming we get the API right and hitting all the use cases. On top of all this I am _very_ performance sensitive so you'll have to be pretty aggressive with benchmarking things. I have concerns about ceding control over critical functionality that I need for pandas (which has become a large and very important library these days for a lot of people), but as long as the pieces in NumPy are suitably mature and robust for me to switch to them eventually that would be great.

I'll do my best to stay involved in the discussion, though I'm juggling a lot of things these days (e.g. I have the PyData book deadline approaching like a freight train).

- Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Bryan Van de Ven

6:20 p.m.

On 6/13/12 5:11 PM, Wes McKinney wrote:

...

And retrieving group indicies/summing:

In [8]: %timeit arr=='a' 1000 loops, best of 3: 1.52 ms per loop In [10]: vals = np.random.randn(1000000) In [20]: inds = [arr==x for x in lets] In [23]: %timeit for ind in inds: vals[ind].sum() 10 loops, best of 3: 48.3 ms per loop (FYI you're comparing an O(NK) algorithm with an O(N) algorithm for small K)

I am not familiar with the details of your groupby implementation (evidently!), consider me appropriately chastised. Bryan

Thouis (Ray) Jones

7:57 a.m.

On Wed, Jun 13, 2012 at 8:54 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

...

Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly.

The other big reason to consider allowing khash (or some other hash implementation) within numpy is that you can use it without the GIL.

Nathaniel Smith

6:10 a.m.

On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

...

It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object

It seems like there are three obvious options, from most to least general: 1) Allow levels to be an arbitrary collection of hashable Python objects 2) Allow levels to be a homogenous collection of objects of any arbitrary numpy dtype 3) Allow levels to be chosen a few fixed types (strings and ints, I guess) I agree that (3) is a bit limiting. (1) is probably easier to implement than (2). (2) is the most general, since of course "arbitrary Python object" is a dtype. Is it useful to be able to restrict levels to be of homogenous type? The main difference between dtypes and python types is that (most) dtype scalars can be unboxed -- is that substantively useful for levels?

...

What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations.

If we have a type restriction on levels (options (2) or (3) above), then how to handle out-of-bounds values is quite a problem, yeah. Once we have NA dtypes then I suppose we could use those, but we don't yet. It's tempting to just error out of any operation that encounters such values.

...

Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly.

Right, I saw that posting -- it's clear that khash has a lot of advantages as internal temporary storage for a specific operation like groupby on unboxed types. But I can't tell whether those arguments still apply now that we're talking about a long-term storage representation for data that has to support a variety of operations (many of which would require boxing/unboxing, since the API is in Python), might or might not use boxed types, etc. Obviously this also depends on which of the three options above we go with -- unboxing doesn't even make sense for option (1). -n

Wes McKinney

4:04 p.m.

On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

...
It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object

It seems like there are three obvious options, from most to least general:

1) Allow levels to be an arbitrary collection of hashable Python objects 2) Allow levels to be a homogenous collection of objects of any arbitrary numpy dtype 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)

I agree that (3) is a bit limiting. (1) is probably easier to implement than (2). (2) is the most general, since of course "arbitrary Python object" is a dtype. Is it useful to be able to restrict levels to be of homogenous type? The main difference between dtypes and python types is that (most) dtype scalars can be unboxed -- is that substantively useful for levels?

...
What is the story for NA values (NaL?) in a factor array? I code them as -1 in the labels, though you could use INT32_MAX or something. This is very important in the context of groupby operations.

If we have a type restriction on levels (options (2) or (3) above), then how to handle out-of-bounds values is quite a problem, yeah. Once we have NA dtypes then I suppose we could use those, but we don't yet. It's tempting to just error out of any operation that encounters such values.

...
Nathaniel: my experience (see blog posting above for a bit more) is that khash really crushes PyDict for two reasons: you can use it with primitive types and avoid boxing, and secondly you can preallocate. Its memory footprint with large hashtables is also a fraction of PyDict. The Python memory allocator is not problematic-- if you create millions of Python objects expect the RAM usage of the Python process to balloon absurdly.

Right, I saw that posting -- it's clear that khash has a lot of advantages as internal temporary storage for a specific operation like groupby on unboxed types. But I can't tell whether those arguments still apply now that we're talking about a long-term storage representation for data that has to support a variety of operations (many of which would require boxing/unboxing, since the API is in Python), might or might not use boxed types, etc. Obviously this also depends on which of the three options above we go with -- unboxing doesn't even make sense for option (1).

-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'm in favor of option #2 (a lite version of what I'm doing currently-- I handle a few dtypes (PyObject, int64, datetime64, float64), though you'd have to go the code-generation route for all the dtypes to keep yourself sane if you do that. - Wes

Nathaniel Smith

5:19 p.m.

On Sun, Jun 17, 2012 at 9:04 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

...

...
On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

...
It looks like the levels can only be strings. This is too limited for my needs. Why not support all possible NumPy dtypes? In pandas world, the levels can be any unique Index object

It seems like there are three obvious options, from most to least general:

1) Allow levels to be an arbitrary collection of hashable Python objects 2) Allow levels to be a homogenous collection of objects of any arbitrary numpy dtype 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)

I agree that (3) is a bit limiting. (1) is probably easier to implement than (2). (2) is the most general, since of course "arbitrary Python object" is a dtype. Is it useful to be able to restrict levels to be of homogenous type? The main difference between dtypes and python types is that (most) dtype scalars can be unboxed -- is that substantively useful for levels? [...] I'm in favor of option #2 (a lite version of what I'm doing currently-- I handle a few dtypes (PyObject, int64, datetime64, float64), though you'd have to go the code-generation route for all

On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith <njs@pobox.com> wrote: the dtypes to keep yourself sane if you do that.

Why would you do code generation? dtype's already expose a generic API for doing boxing/unboxing/etc. Are you thinking this would just be too slow or...? -N

Bryan Van de Ven

6:06 p.m.

On 6/13/12 1:12 PM, Nathaniel Smith wrote:

...

your-branch's-base-master but not in your-repo's-master are new stuff that you did on your branch. Solution is just to do git push<your github remote name> master

Fixed, thanks.

...

Yes, of course we *could* write the code to implement these "open" dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot.

[snip] So would a proper implementation of np.fromfile that normalized the level ordering.

My understanding of the impetus for the open type was sensitivity to the performance of having to make two passes over large text datasets. We'll have to get more feedback from users here and input from Travis, I think.

...

categories in their data, I don't know. But all your arguments here seem to be of the form "hey, it's not *that* bad", and it seems like there must be some actual affirmative advantages it has over PyDict if it's going to be worth using.

I should have been more specific about the performance concerns. Wes summed them up, though: better space efficiency, and not having to box/unbox native types.

...

...
I think I like "categorical" over "factor" but I am not sure we should ditch "enum". There are two different use cases here: I have a pile of strings (or scalars) that I want to treat as discrete things (categories), and: I have a pile of numbers that I want to give convenient or meaningful names to (enums). This latter case was the motivation for possibly adding "Natural Naming". So mention the word "enum" in the documentation, so people looking for that will find the categorical data support? :-)

...

...
...
I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this. This could certainly be improved, I agree. I don't want to be Mr. Grumpypants here, but I do want to make sure we're speaking the same language: what "-1" means is "I consider this a show-stopper and will oppose merging any code that does not improve on this". (Of course you also always have the option of trying to change my mind. Even Mr. Grumpypants can be swayed by logic!) Well, a few comments. The special case in array_richcompare is due to

I'm not sure I follow. Natural Naming seems like a great idea for people that want something like an actual enum (i.e., a way to avoid magic numbers). We could even imagine some nice with-hacks: colors = enum(['red', 'green', 'blue') with colors: foo.fill(red) bar.fill(blue) But natural naming will not work with many category names ("VERY HIGH") if they have spaces, etc. So, we could add a parameter to factor(...) that turns on and off natural naming for a dtype object when it is created: colors = factor(['red', 'green', 'blue'], closed=True, natural_naming=False) vs colors = enum(['red', 'green', 'blue']) I think the latter is better, not only because it is more parsimonious, but because it also expresses intent better. Or we can just not have natural naming at all, if no one wants it. It hasn't been implemented yet, so that would be a snap. :) Hopefully we'll get more feedback from the list. the lack of string ufuncs. I think it would be great to have string ufuncs, but I also think it is a separate concern and outside the scope of this proposal. The special case in arraydescr_typename_get is for the same reason as datetime special case, the need to access dtype metadata. I don't think you are really concerned about these two, though? That leaves the special case in PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit when I put that in. On the other hand, having dtypes with this extent of attached metadata, and potentially dynamic metadata, is unique in NumPy. It was simple and straightforward to add those few lines of code, and does not affect performance. How invasive will the changes to core ufunc machinery be to accommodate a type like this more generally? I took the easy way because I was new to the numpy codebase and did not feel confident mucking with the central ufunc code. However, maybe the dispatch can be accomplished easily with the casting machinery. I am not so sure, I will have to investigate. Of course, I welcome input, suggestions, and proposals on the best way to improve this.

...

...
I'm glad Francesc and Wes are aware of the work, but my point was that that isn't enough. So if I were in your position and hoping to get this code merged, I'd be trying to figure out how to get them more actively on board?

Is there some other way besides responding to and attempting to accommodate technical needs? Bryan

Dag Sverre Seljebotn

6:12 a.m.

On 06/14/2012 12:06 AM, Bryan Van de Ven wrote:

...

On 6/13/12 1:12 PM, Nathaniel Smith wrote:

...
your-branch's-base-master but not in your-repo's-master are new stuff that you did on your branch. Solution is just to do git push<your github remote name> master

Fixed, thanks.

...
Yes, of course we *could* write the code to implement these "open" dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot.

[snip] So would a proper implementation of np.fromfile that normalized the level ordering.

My understanding of the impetus for the open type was sensitivity to the performance of having to make two passes over large text datasets. We'll have to get more feedback from users here and input from Travis, I think.

Can't you just build up the file using uint8, collecting enum values in a separate dict, and then recast the array with the final enum in the end? Or, recast the array with a new enum type every time one wants to add an enum value? (Similar to how you append to a tuple...) (Yes, normalizing level ordering requires another pass through the parsed data array, but that's unavoidable and rather orthogonal to whether one has an open enum dtype API or not.) A mutable dtype gives me the creeps. dtypes currently implements __hash__ and __eq__ and can be used as dict keys, which I think is very valuable. Making them sometimes mutable would cause a confusing situations. There are cases for mutable objects that become immutable, but it should be very well motivated as it makes for a much more confusing API... Dag

Nathaniel Smith

6:06 p.m.

On Wed, Jun 13, 2012 at 11:06 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:

...

On 6/13/12 1:12 PM, Nathaniel Smith wrote:

...
Yes, of course we *could* write the code to implement these "open" dtypes, and then write the documentation, examples, tutorials, etc. to help people work around their limitations. Or, we could just implement np.fromfile properly, which would require no workarounds and take less code to boot.

[snip] So would a proper implementation of np.fromfile that normalized the level ordering.

My understanding of the impetus for the open type was sensitivity to the performance of having to make two passes over large text datasets. We'll have to get more feedback from users here and input from Travis, I think.

You definitely don't want to make two passes over large text datasets, but that's not required. While reading through the data, you keep a dict mapping levels to integer values, which you assign arbitrarily as new levels are encountered, and an integer array holding the integer value for each line of the file. Then at the end of the file, you sort the levels, figure out what the proper integer value for each level is, and do a single in-memory pass through your array, swapping each integer value for the new correct integer value. Since your original integer values are assigned densely, you can map the old integers to the new integers using a single array lookup. This is going to be much faster than any text file reader. There may be some rare people who have huge data files, fast storage, a very large number of distinct levels, and don't care about normalizing level order. But I really think the default should be to normalize level ordering, and then once you can do that, it's trivial to add a "don't normalize please" option for anyone who wants it.

...

...
...
I think I like "categorical" over "factor" but I am not sure we should ditch "enum". There are two different use cases here: I have a pile of strings (or scalars) that I want to treat as discrete things (categories), and: I have a pile of numbers that I want to give convenient or meaningful names to (enums). This latter case was the motivation for possibly adding "Natural Naming". So mention the word "enum" in the documentation, so people looking for that will find the categorical data support? :-)

I'm not sure I follow.

So the above discussion was just about what to name things, and I was saying that we don't need to use the word "enum" in the API itself, whatever the design ends up looking like. That said, I am not personally sold on the idea of using these things in enum-like roles. There are already tons of "enum" libraries on PyPI (I linked some of them in the last thread on this), and I don't see how this design could handle all the basic use cases for enums. Flag bits are one of the most common enums, after all, but red|green is just NaL. So I'm +0 on just sticking to categorical data.

...

Natural Naming seems like a great idea for people that want something like an actual enum (i.e., a way to avoid magic numbers). We could even imagine some nice with-hacks:

colors = enum(['red', 'green', 'blue') with colors: foo.fill(red) bar.fill(blue)

FYI you can't really do this with a context manager. This is the closest I managed: https://gist.github.com/2347382 and you'll note that it still requires reaching up the stack and directly rewriting the C fields of a PyFrameObject while it is in the middle of executing... this is surprisingly less horrible than it sounds, but that still leaves a lot of room for horribleness.

...

...
...
...
I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this. This could certainly be improved, I agree. I don't want to be Mr. Grumpypants here, but I do want to make sure we're speaking the same language: what "-1" means is "I consider this a show-stopper and will oppose merging any code that does not improve on this". (Of course you also always have the option of trying to change my mind. Even Mr. Grumpypants can be swayed by logic!) Well, a few comments. The special case in array_richcompare is due to the lack of string ufuncs. I think it would be great to have string ufuncs, but I also think it is a separate concern and outside the scope of this proposal. The special case in arraydescr_typename_get is for the same reason as datetime special case, the need to access dtype metadata. I don't think you are really concerned about these two, though?

That leaves the special case in PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit when I put that in. On the other hand, having dtypes with this extent of attached metadata, and potentially dynamic metadata, is unique in NumPy. It was simple and straightforward to add those few lines of code, and does not affect performance. How invasive will the changes to core ufunc machinery be to accommodate a type like this more generally? I took the easy way because I was new to the numpy codebase and did not feel confident mucking with the central ufunc code. However, maybe the dispatch can be accomplished easily with the casting machinery. I am not so sure, I will have to investigate. Of course, I welcome input, suggestions, and proposals on the best way to improve this.

I haven't gone back and looked over all the special cases in detail, but my general point is that ufunc's need to be able to access dtype metadata, and the fact that we're now talking about hard-coding special case workarounds for this for a third dtype is pretty compelling evidence of that. We'd already have full-fledged third-party categorical dtypes if they didn't need special cases in numpy. So I think we should fix the root problem instead of continuing to paper over it. We're not talking about a major re-architecting of numpy or anything. -n

Francesc Alted

6:48 a.m.

On 6/13/12 8:12 PM, Nathaniel Smith wrote:

...

...
...
I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems. Francesc is certainly aware of this work, and I emailed Wes earlier this week, I probably should have mentioned that, though. Hopefully they will have time to contribute their thoughts. I also imagine Travis can speak on behalf of the users he has interacted with over the last several years that have requested this feature that don't happen to follow mailing lists. I'm glad Francesc and Wes are aware of the work, but my point was that that isn't enough. So if I were in your position and hoping to get this code merged, I'd be trying to figure out how to get them more actively on board?

Sorry to chime in late. Yes, I am aware of the improvements that Bryan (and Mark) are proposing. My position here is that I'm very open to this (at least from a functional point of view; I have to recognize that I have not had a look into the code). The current situation for the HDF5 wrappers (at least PyTables ones) is that, due to the lack of support of enums in NumPy itself, we had to come with a specific solution for this. Our approach was pretty simple: basically providing an exhaustive set or list of possible, named values for different integers. And although I'm not familiar with the implementation details (it was Ivan Vilata who implemented this part), I think we used an internal dictionary for doing the translation while PyTables is presenting the enums to the user. Bryan is implementing a much more complete (and probably more efficient) support for enums in NumPy. As this is new functionality, and PyTables does not trust on it, there is not an immediate danger (i.e. a backward incompatibility) on introducing the new enums in NumPy. But they could be used for future PyTables versions (and other HDF5 wrappers), which is a good thing indeed. My 2 cents, -- Francesc Alted

4652

Age (days ago)

4657

Last active (days ago)

List overview

Download

18 comments

6 participants

participants (6)

Bryan Van de Ven
Dag Sverre Seljebotn
Francesc Alted
Nathaniel Smith
Thouis (Ray) Jones
Wes McKinney

Enum/Factor NEP (now with code)

Thouis (Ray) Jones

Francesc Alted

tags

participants (6)