[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Tue Jun 5 22:10:04 CEST 2012


On 06/04/2012 11:43 PM, Robert Bradshaw wrote:
> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
>>>
>>> Me and Robert had a long discussion on the NumFOCUS list about this
>>> already, but I figured it was better to continue it and provide more
>>> in-depth benchmark results here.
>>>
>>> It's basically a new idea of how to provide a vtable based on perfect
>>> hashing, which should be a lot simpler to implement than what I first
>>> imagined.
>>>
>>> I'll write down some context first, if you're familiar with this
>>> skip ahead a bit..
>>>
>>> This means that you can do fast dispatches *without* the messy
>>> business of binding vtable slots at compile time. To be concrete, this
>>> might e.g. take the form
>>>
>>> def f(obj):
>>> obj.method(3.4) # try to find a vtable with "void method(double)" in it
>>>
>>> or, a more typed approach,
>>>
>>> # File A
>>> cdef class MyImpl:
>>> def double method(double x): return x * x
>>>
>>> # File B
>>> # Here we never know about MyImpl, hence "duck-typed"
>>> @cython.interface
>>> class MyIntf:
>>> def double method(double x): pass
>>>
>>> def f(MyIntf obj):
>>> # obj *can* be MyImpl instance, or whatever else that supports
>>> # that interface
>>> obj.method(3.4)
>>>
>>>
>>> Now, the idea to implement this is:
>>>
>>> a) Both caller and callee pre-hash name/argument string
>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
>>> md5)
>>>
>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect*
>>> hashing. What you do is define a final hash fh as a function
>>> of the pre-hash ph, for instance
>>>
>>> fh = ((ph>>  vtable.r1) ^ (ph>>  vtable.r2) ^ (ph>>  vtable.r3))&
>>> vtable.m
>>>
>>> (Me and Robert are benchmarking different functions to use here.) By
>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and
>>> will be able to pick a combination which gives *no* (or very few)
>>> collisions.
>>>
>>> c) Caller then combines the pre-hash generated at compile-time, with
>>> r1, r2, r3, m stored in the vtable header, in order to find the
>>> final location in the hash-table.
>>>
>>> The exciting thing is that in benchmark, the performance penalty is
>>> actually very slight over a C++-style v-table. (Of course you can
>>> cache a proper vtable, but the fact that you get so close without
>>> caring about caching means that this can be done much faster.)
>
> One advantage about caching a vtable is that one can possibly put in
> adapters for non-exact matches. It also opens up the possibility of
> putting in stubs to call def methods if they exist. This needs to be
> fleshed out more, (another CEP :) but could provide for a
> backwards-compatible easy first implementation.
>
>>> Back to my and Robert's discussion on benchmarks:
>>>
>>> I've uploaded benchmarks here:
>>>
>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench
>>>
>>> I've changed the benchmark taking to give more robust numbers (at
>>> least for me), you want to look at the 'min' column.
>>>
>>> I changed the benchmark a bit so that it benchmarks a *callsite*.
>>> So we don't pass 'h' on the stack, but either a) looks it up in a global
>>> variable (default), or b) it's a compile-time constant (immediate in
>>> assembly) (compile with -DIMHASH).
>>>
>>> Similarly, the ID is either an "interned" global variable, or an
>>> immediate (-DIMID).
>>>
>>> The results are very different on my machine depending on this aspect.
>>> My conclusions:
>>>
>>> - Both three shifts with masking, two shifts with a "fallback slot"
>>> (allowing for a single collision), three shifts, two shifts with
>>> two masks allows for constructing good vtables. In the case of only
>>> two shifts, one colliding method gets the twoshift+fback
>>> performance and the rest gets the twoshift performance.
>>>
>>> - Performance is really more affected by whether hashes are
>>> immediates or global variables than the hash function. This is in
>>> contrast to the interning vs. key benchmarks -- so I think that if
>>> we looked up the vtable through PyTypeObject, rather than getting
>>> the vtable directly, the loads of the global variables could
>>> potentially be masked by that.
>>>
>>> - My conclusion: Just use lower bits of md5 *both* for the hashing
>>> and the ID-ing (don't bother with any interning), and compile the
>>> thing as a 64-bit immediate. This can cause crashes/stack smashes
>>> etc. if there's lower-64bit-of-md5 collisions, but a) the
>>> probability is incredibly small, b) it would only matter in
>>> situations that should cause an AttributeError anyway, c) if we
>>> really care, we can always use an interning-like mechanism to
>>> validate on module loading that its hashes doesn't collide with
>>> other hashes (and raise an exception "Congratulations, you've
>>> discovered a phenomenal md5 collision, get in touch with cython
>>> devs and we'll work around it right away").
>
> Due to the birthday paradox, this seems a bit risky. Maybe it's
> because I regularly work with collections much bigger than 2^32, and I
> suppose we're talking about unique method names and signatures here,
> but still... I wonder what the penalty would be for checking the full
> 128 bit hash. (Storing it could allow for greater entropy in the
> optimal hash table search as well).

Wonder no more. Here's the penalty for different bit-lengths, all 
compile-time constants:

      threeshift: min=6.08e-09  mean=6.11e-09  std=2.81e-11 
val=1200000000.000000
    threeshift96: min=7.53e-09  mean=7.55e-09  std=1.96e-11 
val=1200000000.000000
   threeshift128: min=6.95e-09  mean=6.97e-09  std=2.57e-11 
val=1200000000.000000
   threeshift160: min=8.17e-09  mean=8.23e-09  std=4.06e-11 
val=1200000000.000000

And for comparison, when loading the comparison IDs from global variable:

      threeshift: min=6.46e-09  mean=6.52e-09  std=4.95e-11 
val=1200000000.000000
    threeshift96: min=8.07e-09  mean=8.16e-09  std=4.55e-11 
val=1200000000.000000
   threeshift128: min=8.06e-09  mean=8.18e-09  std=6.71e-11 
val=1200000000.000000
   threeshift160: min=9.71e-09  mean=9.83e-09  std=5.12e-11 
val=1200000000.000000

So indeed,

64-bit hash < interning < 128 bit hash

(At least on my Intel Nehalem Core i7 1.87GhZ)

And the load of the global variable may in real life be hidden by other 
things going on in the function.

And, you save vtable memory by having an interned char* and not saving 
the hash in the vtable.

They should be made more easily runnable so that we could run them on 
various systems, but it makes sense to first read up on and figure out 
which hash functions are really viable, to keep the number of numbers down.

I just realized that I never pushed the changes I did to introduce 
-DIMHASH/-DIMID etc., but the benchmarks are pushed now.


> We could also do a fallback table. Usually it'd be empty, Occasionally
> it'd have one element in it. It'd always be possible to make this big
> enough to avoid collisions in a worst-case scenario.

If you do a fallback table it's as much code in the call site as linear 
probing...

But when I played with the generation side, a failure to create a table 
at a given size would *always* be due to a single collision. This is 
what I did in the twoshift+fback benchmark.

 > Duplicate tables works as long as there aren't too many orthogonal
 > considerations. Is the GIL the only one? What about "I can propagate
 > errors?" Now we're up to 4 tables...

Would your decision of whether or not to dispatch to a function depend 
on whether or not it propagates errors?

I'm thinking of the "with gil" function case, i.e. callee has:

  a) Function to call if you have the GIL
  b) GIL-acquiring wrapper

and you want GIL-holding code to call a) and nogil code to call b).

But one could just make the caller acquire the GIL if needed (which in 
that case is so expensive anyway that it can be made the unlikely() path).

I can't think of other situations where you would pick which function to 
call based on flags.

Dag


More information about the cython-devel mailing list