[Numpy-discussion] Enum/Factor NEP (now with code)

Sun Jun 17 16:04:17 EDT 2012

On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> It looks like the levels can only be strings. This is too limited for
>> my needs. Why not support all possible NumPy dtypes? In pandas world,
>> the levels can be any unique Index object
>
> It seems like there are three obvious options, from most to least general:
>
> 1) Allow levels to be an arbitrary collection of hashable Python objects
> 2) Allow levels to be a homogenous collection of objects of any
> arbitrary numpy dtype
> 3) Allow levels to be chosen a few fixed types (strings and ints, I guess)
>
> I agree that (3) is a bit limiting. (1) is probably easier to
> implement than (2). (2) is the most general, since of course
> "arbitrary Python object" is a dtype. Is it useful to be able to
> restrict levels to be of homogenous type? The main difference between
> dtypes and python types is that (most) dtype scalars can be unboxed --
> is that substantively useful for levels?
>
>> What is the story for NA values (NaL?) in a factor array? I code them
>> as -1 in the labels, though you could use INT32_MAX or something. This
>> is very important in the context of groupby operations.
>
> If we have a type restriction on levels (options (2) or (3) above),
> then how to handle out-of-bounds values is quite a problem, yeah. Once
> we have NA dtypes then I suppose we could use those, but we don't yet.
> It's tempting to just error out of any operation that encounters such
> values.
>
>> Nathaniel: my experience (see blog posting above for a bit more) is
>> that khash really crushes PyDict for two reasons: you can use it with
>> primitive types and avoid boxing, and secondly you can preallocate.
>> Its memory footprint with large hashtables is also a fraction of
>> PyDict. The Python memory allocator is not problematic-- if you create
>> millions of Python objects expect the RAM usage of the Python process
>> to balloon absurdly.
>
> Right, I saw that posting -- it's clear that khash has a lot of
> advantages as internal temporary storage for a specific operation like
> groupby on unboxed types. But I can't tell whether those arguments
> still apply now that we're talking about a long-term storage
> representation for data that has to support a variety of operations
> (many of which would require boxing/unboxing, since the API is in
> Python), might or might not use boxed types, etc. Obviously this also
> depends on which of the three options above we go with -- unboxing
> doesn't even make sense for option (1).
>
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

I'm in favor of option #2 (a lite version of what I'm doing
currently-- I handle a few dtypes (PyObject, int64, datetime64,
float64), though you'd have to go the code-generation route for all
the dtypes to keep yourself sane if you do that.

- Wes