[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 18:06:29 EDT 2012

On 6/13/12 1:12 PM, Nathaniel Smith wrote:
> your-branch's-base-master but not in your-repo's-master are new stuff
> that you did on your branch. Solution is just to do
>    git push<your github remote name>  master

Fixed, thanks.

> Yes, of course we *could* write the code to implement these "open"
> dtypes, and then write the documentation, examples, tutorials, etc. to
> help people work around their limitations. Or, we could just implement
> np.fromfile properly, which would require no workarounds and take less
> code to boot.
>
> [snip]
> So would a proper implementation of np.fromfile that normalized the
> level ordering.

My understanding of the impetus for the open type was sensitivity to the 
performance of having to make two passes over large text datasets. We'll 
have to get more feedback from users here and input from Travis, I think.

> categories in their data, I don't know. But all your arguments here
> seem to be of the form "hey, it's not *that* bad", and it seems like
> there must be some actual affirmative advantages it has over PyDict if
> it's going to be worth using.

I should have been more specific about the performance concerns. Wes 
summed them up, though: better space efficiency, and not having to 
box/unbox native types.

>> I think I like "categorical" over "factor" but I am not sure we should
>> ditch "enum". There are two different use cases here: I have a pile of
>> strings (or scalars) that I want to treat as discrete things
>> (categories), and: I have a pile of numbers that I want to give
>> convenient or meaningful names to (enums). This latter case was the
>> motivation for possibly adding "Natural Naming".
> So mention the word "enum" in the documentation, so people looking for
> that will find the categorical data support? :-)

I'm not sure I follow. Natural Naming seems like a great idea for people 
that want something like an actual enum (i.e., a way to avoid magic 
numbers). We could even imagine some nice with-hacks:

     colors = enum(['red', 'green', 'blue')
     with colors:
         foo.fill(red)
         bar.fill(blue)

But natural naming will not work with many category names ("VERY HIGH") 
if they have spaces, etc. So, we could add a parameter to factor(...) 
that turns on and off natural naming for a dtype object when it is created:

colors = factor(['red', 'green', 'blue'], closed=True, natural_naming=False)

vs

colors = enum(['red', 'green', 'blue'])

I think the latter is better, not only because it is more parsimonious, 
but because it also expresses intent better. Or we can just not have 
natural naming at all, if no one wants it. It hasn't been implemented 
yet, so that would be a snap. :) Hopefully we'll get more feedback from 
the list.

>>> I'm disturbed to see you adding special cases to the core ufunc
>>> dispatch machinery for these things. I'm -1 on that. We should clean
>>> up the generic ufunc machinery so that it doesn't need special cases
>>> to handle adding a simple type like this.
>> This could certainly be improved, I agree.
> I don't want to be Mr. Grumpypants here, but I do want to make sure
> we're speaking the same language: what "-1" means is "I consider this
> a show-stopper and will oppose merging any code that does not improve
> on this". (Of course you also always have the option of trying to
> change my mind. Even Mr. Grumpypants can be swayed by logic!)
Well, a few comments. The special case in array_richcompare is due to 
the lack of string ufuncs. I think it would be great to have string 
ufuncs, but I also think it is a separate concern and outside the scope 
of this proposal. The special case in arraydescr_typename_get is for the 
same reason as datetime special case, the need to access dtype metadata. 
I don't think you are really concerned about these two, though?

That leaves the special case in 
PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit 
when I put that in. On the other hand, having dtypes with this extent of 
attached metadata, and potentially dynamic metadata, is unique in NumPy. 
It was simple and straightforward to add those few lines of code, and 
does not affect performance. How invasive will the changes to core ufunc 
machinery be to accommodate a type like this more generally? I took the 
easy way because I was new to the numpy codebase and did not feel 
confident mucking with the central ufunc code. However, maybe the 
dispatch can be accomplished easily with the casting machinery. I am not 
so sure, I will have to investigate.  Of course, I welcome input, 
suggestions, and proposals on the best way to improve this.

>> I'm glad Francesc and Wes are aware of the work, but my point was that
>> that isn't enough. So if I were in your position and hoping to get
>> this code merged, I'd be trying to figure out how to get them more
>> actively on board?

Is there some other way besides responding to and attempting to 
accommodate technical needs?

Bryan