[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 09:33:41 EDT 2012

On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven <bryanv at continuum.io> wrote:
> Hi all,
>
> It has been some time, but I do have an update regarding this proposed
> feature. I thought it would be helpful to flesh out some parts of a
> possible implementation to learn what can be spelled reasonably in
> NumPy. Mark Wiebe helped out greatly in navigating the NumPy code
> codebase. Here is a link to my branch with this code;
>
>     https://github.com/bryevdv/numpy/tree/enum
>
> and the updated NEP:
>
>     https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> Not everything in the NEP is implemented (integral levels and natural
> naming in particular) and some parts definitely need more fleshing out.
> However, things currently work basically as described in the NEP, and
> there is also a small set of tests that demonstrate current usage. A few
> things will crash python (astype especially). More tests are needed. I
> would appreciate as much feedback and discussion as you can provide!

Hi Bryan,

I skimmed over the diff:
   https://github.com/bryevdv/numpy/compare/master...enum
It was a bit hard to read since it seems like about half the changes
in that branch are datatime cleanups or something? I hope you'll
separate those out -- it's much easier to review self-contained
changes, and the more changes you roll together into a big lump, the
more risk there is that they'll get lost all together.

>From the updated NEP I actually understand the use case for "open
types" now, so that's good :-). But I don't think they're actually
workable, so that's bad :-(. The use case, as I understand it, is for
when you want to extend the levels set on the fly as you read through
a file. The problem with this is that it produces a non-deterministic
level ordering, where level 0 is whatever was seen first in the file,
level 1 is whatever was seen second, etc. E.g., say I have a CSV file
I read in:

    subject,initial_skill,skill_after_training
    1,LOW,HIGH
    2,LOW,LOW
    3,HIGH,HIGH
    ...

With the scheme described in the NEP, my initial_skill dtype will have
levels ["LOW", "HIGH"], and by skill_after_training dtype will have
levels ["HIGH","LOW"], which means that their storage will be
incompatible, comparisons won't work (or will have to go through some
nasty convert-to-string-and-back path), etc. Another situation where
this will occur is if you have multiple data files in the same format;
whether or not you're able to compare the data from them will depend
on the order the data happens to occur in in each file. The solution
is that whenever we automagically create a set of levels from some
data, and the user hasn't specified any order, we should pick an order
deterministically by sorting the levels. (This is also what R does.
levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a")))
-> "a", "b".)

I'm inclined to say therefore that we should just drop the "open type"
idea, since it adds complexity but doesn't seem to actually solve the
problem it's designed for.

Can you explain why you're using khash instead of PyDict? It seems to
add a *lot* of complexity -- like it seems like you're using about as
many lines of code just marshalling data into and out of the khash as
I used for my old npenum.pyx prototype (not even counting all the
extra work required to , and AFAICT my prototype has about the same
amount of functionality as this. (Of course that's not entirely fair,
because I was working in Cython... but why not work in Cython?) And
you'll need to expose a Python dict interface sooner or later anyway,
I'd think?

I can't tell if it's worth having categorical scalar types. What value
do they provide over just using scalars of the level type?

Terminology: I'd like to suggest we prefer the term "categorical" for
this data, rather than "factor" or "enum". Partly this is because it
makes my life easier ;-):
  https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
and partly because numpy has a very diverse set of users and I suspect
that "categorical" will just be a more transparent name to those who
aren't already familiar with the particular statistical and
programming traditions that "factor" and "enum" come from.

I'm disturbed to see you adding special cases to the core ufunc
dispatch machinery for these things. I'm -1 on that. We should clean
up the generic ufunc machinery so that it doesn't need special cases
to handle adding a simple type like this.

I'm also worried that I still don't see any signs that you're working
with the downstream libraries that this functionality is intended to
be useful for, like the various HDF5 libraries and pandas. I really
don't think this functionality can be merged to numpy until we have
affirmative statements from those developers that they are excited
about it and will use it, and since they're busy people, it's pretty
much your job to track them down and make sure that your code will
solve their problems.

Hope that helps -- it's exciting to see someone working on this, and
you seem to be off to a good start!

-N