[Cython] Hash-based vtables
Robert Bradshaw
robertwb at gmail.com
Tue Jun 5 11:16:34 CEST 2012
On Tue, Jun 5, 2012 at 1:07 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Dag Sverre Seljebotn, 05.06.2012 00:07:
>> The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving.
>>
>> 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset.
>
> Perfect hashing can be done with any fixed size data set (although it's not
> guaranteed to always be the most efficient solution). It doesn't matter if
> you use 64 bits or 128 bits. If 4 bits is enough, go with that. The
> advantage of perfect hashing of a fixed size data set is that the hash
> table has no free space and a match is guaranteed to be exact.
The hash function is f(h(sig)) where f is parameterized but must be
*extremely* cheap and h is fixed without regard to the entry set. This
is why having 128 bits for the output of h may be an advantage.
> However, the problem in this specific case is that the caller and the
> callee do not agree on the same set of entries, so there may be collisions
> during the lookup (of a potentially very large set of signatures) that were
> not anticipated in the perfect hash table layout (of the much smaller set
> of provided signatures). Perfect hashing works here as well, but it looses
> one of its main advantage over other hashing schemes. You then have to
> compare the entries exactly after the lookup in order to make sure that you
> didn't run into a collision, thus loosing time again that you just won with
> the hashing.
>
> But at least you only have to do exactly one such comparison, so that's an
> advantage over a hashing scheme that allows collisions also in the layout.
> Maybe you can even handle mismatches more quickly by adding a dedicated
> "empty" entry for them that most (all?) anticipated mismatches would hash to.
The idea is that the comparison would be cheap, a single 128-bit
compare. The whole point is to avoid branching in the success case.
I agree with you about 64-bit collisions being too high a risk. One
could re-introduce the encoding/interning if desired, but I think
we're safe in assuming no accidental md5 collisions (but hadn't
thought much about the malicious case; if you're allowed to dictate
function pointers you'd better have another line of defense. Perhaps
this needs to be considered more.) We could even use sha1, though I
thought the previous benchmarks indicated that comparing 160 bits was
non-negligibly more expensive than comparing just 64.
- Robert
More information about the cython-devel
mailing list