[Numpy-discussion] Enum/Factor NEP (now with code)

Sun Jun 17 18:06:14 EDT 2012

On Wed, Jun 13, 2012 at 11:06 PM, Bryan Van de Ven <bryanv at continuum.io> wrote:
> On 6/13/12 1:12 PM, Nathaniel Smith wrote:
>> Yes, of course we *could* write the code to implement these "open"
>> dtypes, and then write the documentation, examples, tutorials, etc. to
>> help people work around their limitations. Or, we could just implement
>> np.fromfile properly, which would require no workarounds and take less
>> code to boot.
>>
>> [snip]
>> So would a proper implementation of np.fromfile that normalized the
>> level ordering.
>
> My understanding of the impetus for the open type was sensitivity to the
> performance of having to make two passes over large text datasets. We'll
> have to get more feedback from users here and input from Travis, I think.

You definitely don't want to make two passes over large text datasets,
but that's not required. While reading through the data, you keep a
dict mapping levels to integer values, which you assign arbitrarily as
new levels are encountered, and an integer array holding the integer
value for each line of the file. Then at the end of the file, you sort
the levels, figure out what the proper integer value for each level
is, and do a single in-memory pass through your array, swapping each
integer value for the new correct integer value. Since your original
integer values are assigned densely, you can map the old integers to
the new integers using a single array lookup. This is going to be much
faster than any text file reader.

There may be some rare people who have huge data files, fast storage,
a very large number of distinct levels, and don't care about
normalizing level order. But I really think the default should be to
normalize level ordering, and then once you can do that, it's trivial
to add a "don't normalize please" option for anyone who wants it.

>>> I think I like "categorical" over "factor" but I am not sure we should
>>> ditch "enum". There are two different use cases here: I have a pile of
>>> strings (or scalars) that I want to treat as discrete things
>>> (categories), and: I have a pile of numbers that I want to give
>>> convenient or meaningful names to (enums). This latter case was the
>>> motivation for possibly adding "Natural Naming".
>> So mention the word "enum" in the documentation, so people looking for
>> that will find the categorical data support? :-)
>
> I'm not sure I follow.

So the above discussion was just about what to name things, and I was
saying that we don't need to use the word "enum" in the API itself,
whatever the design ends up looking like.

That said, I am not personally sold on the idea of using these things
in enum-like roles. There are already tons of "enum" libraries on PyPI
(I linked some of them in the last thread on this), and I don't see
how this design could handle all the basic use cases for enums. Flag
bits are one of the most common enums, after all, but red|green is
just NaL. So I'm +0 on just sticking to categorical data.

> Natural Naming seems like a great idea for people
> that want something like an actual enum (i.e., a way to avoid magic
> numbers). We could even imagine some nice with-hacks:
>
>     colors = enum(['red', 'green', 'blue')
>     with colors:
>         foo.fill(red)
>         bar.fill(blue)

FYI you can't really do this with a context manager. This is the
closest I managed:
  https://gist.github.com/2347382
and you'll note that it still requires reaching up the stack and
directly rewriting the C fields of a PyFrameObject while it is in the
middle of executing... this is surprisingly less horrible than it
sounds, but that still leaves a lot of room for horribleness.

>>>> I'm disturbed to see you adding special cases to the core ufunc
>>>> dispatch machinery for these things. I'm -1 on that. We should clean
>>>> up the generic ufunc machinery so that it doesn't need special cases
>>>> to handle adding a simple type like this.
>>> This could certainly be improved, I agree.
>> I don't want to be Mr. Grumpypants here, but I do want to make sure
>> we're speaking the same language: what "-1" means is "I consider this
>> a show-stopper and will oppose merging any code that does not improve
>> on this". (Of course you also always have the option of trying to
>> change my mind. Even Mr. Grumpypants can be swayed by logic!)
> Well, a few comments. The special case in array_richcompare is due to
> the lack of string ufuncs. I think it would be great to have string
> ufuncs, but I also think it is a separate concern and outside the scope
> of this proposal. The special case in arraydescr_typename_get is for the
> same reason as datetime special case, the need to access dtype metadata.
> I don't think you are really concerned about these two, though?
>
> That leaves the special case in
> PyUFunc_SimpleBinaryComparisonTypeResolver. As I said, I chaffed a bit
> when I put that in. On the other hand, having dtypes with this extent of
> attached metadata, and potentially dynamic metadata, is unique in NumPy.
> It was simple and straightforward to add those few lines of code, and
> does not affect performance. How invasive will the changes to core ufunc
> machinery be to accommodate a type like this more generally? I took the
> easy way because I was new to the numpy codebase and did not feel
> confident mucking with the central ufunc code. However, maybe the
> dispatch can be accomplished easily with the casting machinery. I am not
> so sure, I will have to investigate.  Of course, I welcome input,
> suggestions, and proposals on the best way to improve this.

I haven't gone back and looked over all the special cases in detail,
but my general point is that ufunc's need to be able to access dtype
metadata, and the fact that we're now talking about hard-coding
special case workarounds for this for a third dtype is pretty
compelling evidence of that. We'd already have full-fledged
third-party categorical dtypes if they didn't need special cases in
numpy. So I think we should fix the root problem instead of continuing
to paper over it. We're not talking about a major re-architecting of
numpy or anything.

-n