[Cython] Hash-based vtables

Sun Jun 10 11:53:12 CEST 2012

On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/10/2012 10:23 AM, Robert Bradshaw wrote:
>>
>> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>
>>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote:
>>>>
>>>>
>>>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Robert Bradshaw<robertwb at gmail.com>    wrote:
>>>>>
>>>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
>>>>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>>>>
>>>>>>>
>>>>>>> I'd love to not do interning, but I see no way around it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> No, I want to use the lower 64 bits by default, but always have the
>>>>>> top 96 bits around to allow using this mechanism in "secure" mode at a
>>>>>> slight penalty. md5 is out because there are known collisions. (Yes,
>>>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses
>>>>>> have been shown, so we could look to using something else (hopefully
>>>>>> still shipped with Python).
>>>>>
>>>>>
>>>>>
>>>>> But very few users are going to know about this. What's the odds that
>>>>> the
>>>>> user who decide to trigger JIT-compilation with function signatures
>>>>> that
>>>>> varies based on the input will know about the option and turn it on and
>>>>> also
>>>>> recompile all his/her C extension modules?
>>>>>
>>>>> In practice, such an option would always stay at its default value. If
>>>>> we
>>>>> leave it to secure by default and start teaching it to users from the
>>>>> start...but that's a big price to pay.
>>>>
>>>>
>>>>
>>>> Yes, it's not ideal from this perspective.
>>>>
>>>>> And if you *do* want to run in secure mode, it will be a lot slower
>>>>> than
>>>>> interning.
>>>>
>>>>
>>>>
>>>> Are you thinking that the 64-bit interned pointer would be used as the
>>>> hash? In this case all hashtables would have to be constructed at
>>>> runtime, which means it needs to be really, really cheap (well under a
>>>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports
>>>> at startup). Also I'm not sure how the very-uneven distribution would
>>>> play out for constructing perfect hastables (perhaps it won't hurt,
>>>> there's likely to be long runs of consecutive values in some cases.
>>>
>>>
>>>
>>> No, I'm thinking that callsites need both the 64-bit interned char* and
>>> the
>>> 64-bit hash of the *contents*. They use the hash to figure out the
>>> position,
>>> then compare by ID.
>>
>>
>> Ah, I missed that bit. OK, yes, that could work well.
>
>
> Ah, we've been talking past one another for some time then. OK, let's settle
> on that.
>
>
>>
>>> The hash is not stored in callees, it's discarded after figuring out the
>>> table layout.
>>>
>>> (There was this idea that if the char* has least significant bit set,
>>> we'd
>>> hash it directly rather than dereference it, but let's ignore that for
>>> now.)
>>
>>
>> (For the purpose of this discussion, it's part of the "interning" step.)
>>
>>> I don't think under a millisecond is unfeasible to hash smallish tables
>>> --
>>> we could put the pointer through a cheap hash to create more entropy (for
>>> the perfect hashing, being able to select a hash function through the>>r
>>> is
>>> important, so you can't just use the pointer directly -- but there are
>>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/)
>>
>>
>> Just a sec, we're not hashing pointers, but the full signature itself,
>> right? For our hash function we need
>>
>> (1) Collision free on 64-bits (for non-malicious use).
>> (2) Good distribution (including for short strings, which is harder to
>> come by).
>> (2b) Any small subset of bits should have property (2).
>> (3) Ideally easy to reference (e.g. "md5" is better than "these 100
>> lines of C code").
>>
>> Cheap runtime construction is still ideal, but much less of an issue
>> if hashes (and perfect tables) can be constructed at compile time,
>> which I think this scheme allows.
>
>
> Yes, 64 bits of md5 then?

+1 for me.

> ulib contains "100 lines of C code" for md5
> anyway, if one doesn't want to go through Python hashlib (I imagine e.g.
> hashlib might be unavailable somewhere as it relies on openssl and there's
> license war going on vs. gnutls and so on. And the md5 module is
> deprecated.).

Just the interface, right? (hashlib should be used instead...)

>>> That would save us a register and make the instructions shorter in some
>>> places I guess...I think it's really miniscule, it's not like the effect
>>> of
>>> load of a global variable. But if you like this approach I can benchmark
>>> C-written hashtable creation and see.
>>
>>
>> This will have value in and of itself (both the implementation and the
>> benchmarks).
>
>
> Will do (eventually, less spare time in coming week).
>
> About signatures, a problem I see with following the C typing is that the
> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq"
> on 32-bit Linux, and so on. I think that would be really bad.

This is why I suggested promotion for scalars (divide ints into
<=sizeof(long) and sizeof(long) < x <= sizeof(long long)), checked at
C compile time, though I guess you consider that evil. I don't
consider not matching really bad, just kind of bad.

> "l" must be banished -- but then one might as well do "i4i8i8".
>
> Designing a signature hash where you select between these at compile-time is
> perhaps doable but does generate a lot of code and makes everything
> complicated.

It especially gets messy when you're trying to pre-compute tables.

> I think we should just start off with hashing at module load
> time when sizes are known, and then work with heuristics and/or build system
> integration to improve on that afterwards.

Finding 10,000 optimal tables at runtime better be really cheap than
for Sage's sake :).

- Robert