Re: [Numpy-discussion] Enum/Factor NEP (now with code)

June 13, 2012

      On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn
<d.s.seljebotn@astro.uio.no> wrote:
...
On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
...
I'm inclined to say therefore that we should just drop the "open type"
idea, since it adds complexity but doesn't seem to actually solve the
problem it's designed for.
If one wants to have an "open", hassle-free enum, an alternative would
be to cryptographically hash the enum string. I'd trust 64 bits of hash
for this purpose.
The obvious disadvantage is the extra space used, but it'd be a bit more
hassle-free compared to regular enums; you'd never have to fix the set
of enum strings and they'd always be directly comparable across
different arrays. HDF libraries etc. could compress it at the storage
layer, storing the enum mapping in the metadata.
You'd trust 64 bits to be collision-free for all strings ever stored
in numpy, eternally? I wouldn't. Anyway, if the goal is to store an
arbitrary set of strings in 64 bits apiece, then there is no downside
to just using an object array + interning (like pandas does now), and
this *is* guaranteed to be collision free. Maybe it would be useful to
have a "heap string" dtype, but that'd be something different.

AFAIK all the cases where an explicit categorical type adds value over
this are the ones where having an explicit set of levels is useful.
Representing HDF5 enums or R factors requires a way to specify
arbitrary string<->integer mappings, and there are algorithms (e.g. in
charlton) that are much more efficient if they can figure out what the
set of possible levels is directly without scanning the whole array.

-N