[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Tue Jun 5 18:56:46 CEST 2012


On 06/05/2012 11:16 AM, Robert Bradshaw wrote:
> On Tue, Jun 5, 2012 at 1:07 AM, Stefan Behnel<stefan_ml at behnel.de>  wrote:
>> Dag Sverre Seljebotn, 05.06.2012 00:07:
>>> The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving.
>>>
>>> 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset.
>>
>> Perfect hashing can be done with any fixed size data set (although it's not
>> guaranteed to always be the most efficient solution). It doesn't matter if
>> you use 64 bits or 128 bits. If 4 bits is enough, go with that. The
>> advantage of perfect hashing of a fixed size data set is that the hash
>> table has no free space and a match is guaranteed to be exact.
>
> The hash function is f(h(sig)) where f is parameterized but must be
> *extremely* cheap and h is fixed without regard to the entry set. This
> is why having 128 bits for the output of h may be an advantage.
>
>> However, the problem in this specific case is that the caller and the
>> callee do not agree on the same set of entries, so there may be collisions
>> during the lookup (of a potentially very large set of signatures) that were
>> not anticipated in the perfect hash table layout (of the much smaller set
>> of provided signatures). Perfect hashing works here as well, but it looses
>> one of its main advantage over other hashing schemes. You then have to
>> compare the entries exactly after the lookup in order to make sure that you
>> didn't run into a collision, thus loosing time again that you just won with
>> the hashing.

Me and Robert spent some time on those benchmarks, please bother reading 
them before making statements like this.

There's benchmarks both with a 64-bit comparison of an interned ID and 
comparison with the compile-time 64-bit hash (faster). All my benchmarks 
included some comparison after the lookup.

Comparison is very cheap *if* it is the likely() path. Branch misses is 
what counts. Perfect hashing, even with comparison, wins you big-time in 
branch prediction.

>> But at least you only have to do exactly one such comparison, so that's an
>> advantage over a hashing scheme that allows collisions also in the layout.
>> Maybe you can even handle mismatches more quickly by adding a dedicated
>> "empty" entry for them that most (all?) anticipated mismatches would hash to.

You mean, like what I did in the twoshift+fback benchmark?

Getting a single branch miss makes it the slowest one. But all other 
methods (the ones that doesn't collide) run slightly faster than with 
three shifts.

> The idea is that the comparison would be cheap, a single 128-bit
> compare. The whole point is to avoid branching in the success case.
>
> I agree with you about 64-bit collisions being too high a risk. One
> could re-introduce the encoding/interning if desired, but I think
> we're safe in assuming no accidental md5 collisions (but hadn't
> thought much about the malicious case; if you're allowed to dictate
> function pointers you'd better have another line of defense. Perhaps
> this needs to be considered more.) We could even use sha1, though I

I fail to understand the comment on security at all. Why not just use 
the *correct* signature to feed a function that intentionally segfaults 
(or does whatever else)?

> thought the previous benchmarks indicated that comparing 160 bits was
> non-negligibly more expensive than comparing just 64.

Loading an interned ID from a global variable was certainly 
non-negligible too.

Let's do some numbers on how many bits we need here:

Ballpark estimate: Assume that 50 billion lines of Cython code will be 
written over the course of human history (that's like SAGE times 
200,000). Now assume that every 100 lines of code people write, there's 
an entirely new method declaration that's never, ever in the entire 
human history has been written before in Cython => 2**22 signatures will 
occur.

The total probability that a *single* collision (or more) will *ever* 
happen over the course of human history is:

64 bit ID: 5e-7
96 bit ID: 1e-17
128 bit ID: 3e-26
160 bit ID: 6e-36

Computed with, e.g.,:

sage: R=RealField(1000)
sage: n=R(2)**22
sage: 1 - exp(-n * (n-1) / 2 / R(2)**160)

http://en.wikipedia.org/wiki/Birthday_problem

Dag


More information about the cython-devel mailing list