
On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:
On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve the problem it's designed for.
If one wants to have an "open", hassle-free enum, an alternative would be to cryptographically hash the enum string. I'd trust 64 bits of hash for this purpose.
The obvious disadvantage is the extra space used, but it'd be a bit more hassle-free compared to regular enums; you'd never have to fix the set of enum strings and they'd always be directly comparable across different arrays. HDF libraries etc. could compress it at the storage layer, storing the enum mapping in the metadata.
You'd trust 64 bits to be collision-free for all strings ever stored in numpy, eternally? I wouldn't. Anyway, if the goal is to store an arbitrary set of strings in 64 bits apiece, then there is no downside to just using an object array + interning (like pandas does now), and this *is* guaranteed to be collision free. Maybe it would be useful to have a "heap string" dtype, but that'd be something different. AFAIK all the cases where an explicit categorical type adds value over this are the ones where having an explicit set of levels is useful. Representing HDF5 enums or R factors requires a way to specify arbitrary string<->integer mappings, and there are algorithms (e.g. in charlton) that are much more efficient if they can figure out what the set of possible levels is directly without scanning the whole array. -N