[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Wed Jun 6 22:41:44 CEST 2012


On 06/05/2012 12:30 AM, Robert Bradshaw wrote:
> I just found http://cmph.sourceforge.net/ which looks quite
> interesting. Though the resulting hash functions are supposedly cheap,
> I have the feeling that branching is considered cheap in this context.

Actually, this lead was *very* promising. I believe the very first 
reference I actually read through and didn't eliminate after the 
abstract totally swept away our home-grown solutions!

"Hash & Displace" by Pagh (1999) is actually very simple, easy to 
understand, and fast both for generation and (the branch-free) lookup:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf

The idea is:

  - Find a hash `g(x)` to partition the keys into `b` groups (the paper 
requires b > 2n, though I think in practice you can often get away with 
less)

  - Find a hash `f(x)` such that f is 1:1 within each group (which is 
easily achieved since groups only has a few elements)

  - For each group, from largest to smallest: Find a displacement 
`d[group]` so that `f(x) ^ d` doesn't cause collisions.

It requires extra storage for the displacement table. However, I think 8 
bits per element might suffice even for vtables of 512 or 1024 in size. 
Even with 16 bits it's rather negligible compared to the minimum-128-bit 
entries of the table.

I benchmarked these hash functions:

displace1: ((h >> r1) ^ d[h & 63]) & m1
displace2: ((h >> r1) ^ d[h & m2]) & m1
displace3: ((h >> r1) ^ d[(h >> r2) & m2]) & m1

Only the third one is truly in the spirit of the algorithm, but I think 
the first two should work well too (and when h is known compile-time, 
looking up d[h & 63] isn't harder than looking up r1 or m1).

My computer is acting up and all my numbers today are slower than the 
earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, 
and yes, I've pinned the CPU speed). But here's today's numbers, 
compiled with -DIMHASH:

          direct: min=5.37e-09  mean=5.39e-09  std=1.96e-11 
val=2400000000.000000
           index: min=6.45e-09  mean=6.46e-09  std=1.15e-11 
val=1800000000.000000
        twoshift: min=6.99e-09  mean=7.00e-09  std=1.35e-11 
val=1800000000.000000
      threeshift: min=7.53e-09  mean=7.54e-09  std=1.63e-11 
val=1800000000.000000
       displace1: min=6.99e-09  mean=7.00e-09  std=1.66e-11 
val=1800000000.000000
       displace2: min=6.99e-09  mean=7.02e-09  std=2.77e-11 
val=1800000000.000000
       displace3: min=7.52e-09  mean=7.54e-09  std=1.19e-11 
val=1800000000.000000


I did a dirty prototype of the table-finder as well and it works:

https://github.com/dagss/hashvtable/blob/master/pagh99.py

Dag


More information about the cython-devel mailing list