[Numpy-discussion] Enum/Factor NEP (now with code)

Sun Jun 17 06:10:10 EDT 2012

On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> It looks like the levels can only be strings. This is too limited for
> my needs. Why not support all possible NumPy dtypes? In pandas world,
> the levels can be any unique Index object

It seems like there are three obvious options, from most to least general:

1) Allow levels to be an arbitrary collection of hashable Python objects
2) Allow levels to be a homogenous collection of objects of any
arbitrary numpy dtype
3) Allow levels to be chosen a few fixed types (strings and ints, I guess)

I agree that (3) is a bit limiting. (1) is probably easier to
implement than (2). (2) is the most general, since of course
"arbitrary Python object" is a dtype. Is it useful to be able to
restrict levels to be of homogenous type? The main difference between
dtypes and python types is that (most) dtype scalars can be unboxed --
is that substantively useful for levels?

> What is the story for NA values (NaL?) in a factor array? I code them
> as -1 in the labels, though you could use INT32_MAX or something. This
> is very important in the context of groupby operations.

If we have a type restriction on levels (options (2) or (3) above),
then how to handle out-of-bounds values is quite a problem, yeah. Once
we have NA dtypes then I suppose we could use those, but we don't yet.
It's tempting to just error out of any operation that encounters such
values.

> Nathaniel: my experience (see blog posting above for a bit more) is
> that khash really crushes PyDict for two reasons: you can use it with
> primitive types and avoid boxing, and secondly you can preallocate.
> Its memory footprint with large hashtables is also a fraction of
> PyDict. The Python memory allocator is not problematic-- if you create
> millions of Python objects expect the RAM usage of the Python process
> to balloon absurdly.

Right, I saw that posting -- it's clear that khash has a lot of
advantages as internal temporary storage for a specific operation like
groupby on unboxed types. But I can't tell whether those arguments
still apply now that we're talking about a long-term storage
representation for data that has to support a variety of operations
(many of which would require boxing/unboxing, since the API is in
Python), might or might not use boxed types, etc. Obviously this also
depends on which of the three options above we go with -- unboxing
doesn't even make sense for option (1).

-n