[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 12:48:28 EDT 2012

Nathaniel Smith <njs at pobox.com> wrote:

>On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn
><d.s.seljebotn at astro.uio.no> wrote:
>> On 06/13/2012 03:33 PM, Nathaniel Smith wrote:
>>> I'm inclined to say therefore that we should just drop the "open
>type"
>>> idea, since it adds complexity but doesn't seem to actually solve
>the
>>> problem it's designed for.
>>
>> If one wants to have an "open", hassle-free enum, an alternative
>would
>> be to cryptographically hash the enum string. I'd trust 64 bits of
>hash
>> for this purpose.
>>
>> The obvious disadvantage is the extra space used, but it'd be a bit
>more
>> hassle-free compared to regular enums; you'd never have to fix the
>set
>> of enum strings and they'd always be directly comparable across
>> different arrays. HDF libraries etc. could compress it at the storage
>> layer, storing the enum mapping in the metadata.
>
>You'd trust 64 bits to be collision-free for all strings ever stored
>in numpy, eternally? I wouldn't. Anyway, if the goal is to store an
>arbitrary set of strings in 64 bits apiece, then there is no downside
>to just using an object array + interning (like pandas does now), and
>this *is* guaranteed to be collision free. Maybe it would be useful to
>have a "heap string" dtype, but that'd be something different.

Heh, we've been having this discussion before :-)

The 'interned heap string dtype' may be something different, but it could be something that could meet the 'open enum' usecases (assuming they exist) in a better way than making enums complicated.

Consider it a backup strategy if one can't put the open enum idea dead otherwise..

>
>AFAIK all the cases where an explicit categorical type adds value over
>this are the ones where having an explicit set of levels is useful.
>Representing HDF5 enums or R factors requires a way to specify
>arbitrary string<->integer mappings, and there are algorithms (e.g. in
>charlton) that are much more efficient if they can figure out what the
>set of possible levels is directly without scanning the whole array.

For interned strings, the set of strings present could be stored in the array in principle (though I guess it would be very difficult to implement in current numpy).

The perfect hash schemes we've explored on the Cython list lately uses around 10-20 microseconds on my 1.8 GHz for 64-element table rehashing (worst case insertion, happens more often than with insertion in regular hash tables) and 0.5-2 nanoseconds for a lookup in L1 (which always hits on first try if the entry is in the table).

Dag

>_______________________________________________
>NumPy-Discussion mailing list
>NumPy-Discussion at scipy.org
>http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.