[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 12:44:14 EDT 2012

On 6/13/12 8:33 AM, Nathaniel Smith wrote:
> Hi Bryan,
>
> I skimmed over the diff:
>     https://github.com/bryevdv/numpy/compare/master...enum
> It was a bit hard to read since it seems like about half the changes
> in that branch are datatime cleanups or something? I hope you'll
> separate those out -- it's much easier to review self-contained
> changes, and the more changes you roll together into a big lump, the
> more risk there is that they'll get lost all together.

I'm not quite sure what happened there, my git skills are not advanced 
by any measure. I think the datetime changes are a much smaller fraction 
than fifty percent, but I will see what I can do to separate them out in 
the near future.

>  From the updated NEP I actually understand the use case for "open
> types" now, so that's good :-). But I don't think they're actually
> workable, so that's bad :-(. The use case, as I understand it, is for
> when you want to extend the levels set on the fly as you read through
> a file. The problem with this is that it produces a non-deterministic
> level ordering, where level 0 is whatever was seen first in the file,
> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
> I read in:
>
>      subject,initial_skill,skill_after_training
>      1,LOW,HIGH
>      2,LOW,LOW
>      3,HIGH,HIGH
>      ...
>
> With the scheme described in the NEP, my initial_skill dtype will have
> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
> levels ["HIGH","LOW"], which means that their storage will be
> incompatible, comparisons won't work (or will have to go through some

I imagine users using the same open dtype object in both fields of the 
structure dtype used to read in the file, if both fields of the file 
contain the same categories. If they don't contain the same categories, 
they are incomparable in any case. I believe many users have this 
simpler use case where each field is a separate category, and they want 
to read them all individually, separately on the fly.  For these simple 
cases, it would "just work". For your case example there would 
definitely be a documentation, examples, tutorials, education issue, to 
avoid the "gotcha" you describe.

> nasty convert-to-string-and-back path), etc. Another situation where
> this will occur is if you have multiple data files in the same format;
> whether or not you're able to compare the data from them will depend
> on the order the data happens to occur in in each file. The solution
> is that whenever we automagically create a set of levels from some
> data, and the user hasn't specified any order, we should pick an order
> deterministically by sorting the levels. (This is also what R does.
> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
> ->  "a", "b".)

A solution is to create the dtype object when reading in the first file, 
and to reuse that same dtype object when reading in subsequent files. 
Perhaps it's not ideal, but it does enable the work to be done.

> Can you explain why you're using khash instead of PyDict? It seems to
> add a *lot* of complexity -- like it seems like you're using about as
> many lines of code just marshalling data into and out of the khash as
> I used for my old npenum.pyx prototype (not even counting all the
> extra work required to , and AFAICT my prototype has about the same
> amount of functionality as this. (Of course that's not entirely fair,
> because I was working in Cython... but why not work in Cython?) And
> you'll need to expose a Python dict interface sooner or later anyway,
> I'd think?

I suppose I agree with the sentiment that the core of NumPy really ought 
to be less dependent on the Python C API, not more. I also think the 
khash API is pretty dead simple and straightforward, and the fact that 
it is contained in a singe header is attractive.  It's also quite 
performant in time and space. But if others disagree strongly, all of 
it's uses are hidden behind the interface in leveled_dtypes.c, it could 
be replaced with some other mechanism easily enough.

> I can't tell if it's worth having categorical scalar types. What value
> do they provide over just using scalars of the level type?

I'm not certain they are worthwhile either, which is why I did not spend 
any time on them yet. Wes has expressed a desire for very broad 
categorical types (even more than just scalar categories), hopefully he 
can chime in with his motivations.

> Terminology: I'd like to suggest we prefer the term "categorical" for
> this data, rather than "factor" or "enum". Partly this is because it
> makes my life easier ;-):
>    https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
> and partly because numpy has a very diverse set of users and I suspect
> that "categorical" will just be a more transparent name to those who
> aren't already familiar with the particular statistical and
> programming traditions that "factor" and "enum" come from.

I think I like "categorical" over "factor" but I am not sure we should 
ditch "enum". There are two different use cases here: I have a pile of 
strings (or scalars) that I want to treat as discrete things 
(categories), and: I have a pile of numbers that I want to give 
convenient or meaningful names to (enums). This latter case was the 
motivation for possibly adding "Natural Naming".

> I'm disturbed to see you adding special cases to the core ufunc
> dispatch machinery for these things. I'm -1 on that. We should clean
> up the generic ufunc machinery so that it doesn't need special cases
> to handle adding a simple type like this.

This could certainly be improved, I agree.

> I'm also worried that I still don't see any signs that you're working
> with the downstream libraries that this functionality is intended to
> be useful for, like the various HDF5 libraries and pandas. I really
> don't think this functionality can be merged to numpy until we have
> affirmative statements from those developers that they are excited
> about it and will use it, and since they're busy people, it's pretty
> much your job to track them down and make sure that your code will
> solve their problems.

Francesc is certainly aware of this work, and I emailed Wes earlier this 
week, I probably should have mentioned that, though. Hopefully they will 
have time to contribute their thoughts. I also imagine Travis can speak 
on behalf of the users he has interacted with over the last several 
years that have requested this feature that don't happen to follow 
mailing lists.

Thanks,

Bryan