[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sun Jun 10 10:26:55 CEST 2012



Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:

>On 06/10/2012 09:34 AM, Robert Bradshaw wrote:
>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>
>>>
>>> Robert Bradshaw<robertwb at gmail.com>  wrote:
>>>
>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>>> I'd love to not do interning, but I see no way around it.
>>>>
>>>> No, I want to use the lower 64 bits by default, but always have the
>>>> top 96 bits around to allow using this mechanism in "secure" mode
>at a
>>>> slight penalty. md5 is out because there are known collisions.
>(Yes,
>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses
>>>> have been shown, so we could look to using something else
>(hopefully
>>>> still shipped with Python).
>>>
>>> But very few users are going to know about this. What's the odds
>that the user who decide to trigger JIT-compilation with function
>signatures that varies based on the input will know about the option
>and turn it on and also recompile all his/her C extension modules?
>>>
>>> In practice, such an option would always stay at its default value.
>If we leave it to secure by default and start teaching it to users from
>the start...but that's a big price to pay.
>>
>> Yes, it's not ideal from this perspective.
>>
>>> And if you *do* want to run in secure mode, it will be a lot slower
>than interning.
>>
>> Are you thinking that the 64-bit interned pointer would be used as
>the
>> hash? In this case all hashtables would have to be constructed at
>> runtime, which means it needs to be really, really cheap (well under
>a
>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports
>> at startup). Also I'm not sure how the very-uneven distribution would
>> play out for constructing perfect hastables (perhaps it won't hurt,
>> there's likely to be long runs of consecutive values in some cases.
>
>No, I'm thinking that callsites need both the 64-bit interned char* and
>
>the 64-bit hash of the *contents*. They use the hash to figure out the 
>position, then compare by ID.
>
>The hash is not stored in callees, it's discarded after figuring out
>the 
>table layout.
>
>(There was this idea that if the char* has least significant bit set, 
>we'd hash it directly rather than dereference it, but let's ignore that
>
>for now.)
>
>I don't think under a millisecond is unfeasible to hash smallish tables
>
>-- we could put the pointer through a cheap hash to create more entropy
>
>(for the perfect hashing, being able to select a hash function through 
>the >>r is important, so you can't just use the pointer directly -- but
>
>there are functions cheaper than md5, e.g, in here: 
>http://code.google.com/p/ulib/)
>
>That would save us a register and make the instructions shorter in some
>
>places I guess...I think it's really miniscule, it's not like the
>effect 
>of load of a global variable. But if you like this approach I can 
>benchmark C-written hashtable creation and see.


I don't know what I was thinking. Tha callsite can't hash every time, and the pointer doesn't contain enough entropy for perfect hashing, so hashing the pointer has only disadvantages.

I really think the call site should have both a hash and a separate interned ID. And if the caller knows the entry should be there, it can skip the ID check and only needs the hash.

That makes the table pretty slick for non-smart callers too, it would be (id, flags, ptr)-entries, and callers could either do strcmp or interning, with or without hashing. (I realize the information would be there in your proposal too, but this would be slimmer).

Dag



>
>Dag
>_______________________________________________
>cython-devel mailing list
>cython-devel at python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


More information about the cython-devel mailing list