[Cython] Hash-based vtables

Robert Bradshaw robertwb at gmail.com
Mon Jun 4 23:43:07 CEST 2012


On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
>>
>> Me and Robert had a long discussion on the NumFOCUS list about this
>> already, but I figured it was better to continue it and provide more
>> in-depth benchmark results here.
>>
>> It's basically a new idea of how to provide a vtable based on perfect
>> hashing, which should be a lot simpler to implement than what I first
>> imagined.
>>
>> I'll write down some context first, if you're familiar with this
>> skip ahead a bit..
>>
>> This means that you can do fast dispatches *without* the messy
>> business of binding vtable slots at compile time. To be concrete, this
>> might e.g. take the form
>>
>> def f(obj):
>> obj.method(3.4) # try to find a vtable with "void method(double)" in it
>>
>> or, a more typed approach,
>>
>> # File A
>> cdef class MyImpl:
>> def double method(double x): return x * x
>>
>> # File B
>> # Here we never know about MyImpl, hence "duck-typed"
>> @cython.interface
>> class MyIntf:
>> def double method(double x): pass
>>
>> def f(MyIntf obj):
>> # obj *can* be MyImpl instance, or whatever else that supports
>> # that interface
>> obj.method(3.4)
>>
>>
>> Now, the idea to implement this is:
>>
>> a) Both caller and callee pre-hash name/argument string
>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
>> md5)
>>
>> b) Callee (MyImpl) generates a vtable of its methods by *perfect*
>> hashing. What you do is define a final hash fh as a function
>> of the pre-hash ph, for instance
>>
>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) &
>> vtable.m
>>
>> (Me and Robert are benchmarking different functions to use here.) By
>> playing with r1, r2, r3, you have 64**3 choices of hash function, and
>> will be able to pick a combination which gives *no* (or very few)
>> collisions.
>>
>> c) Caller then combines the pre-hash generated at compile-time, with
>> r1, r2, r3, m stored in the vtable header, in order to find the
>> final location in the hash-table.
>>
>> The exciting thing is that in benchmark, the performance penalty is
>> actually very slight over a C++-style v-table. (Of course you can
>> cache a proper vtable, but the fact that you get so close without
>> caring about caching means that this can be done much faster.)

One advantage about caching a vtable is that one can possibly put in
adapters for non-exact matches. It also opens up the possibility of
putting in stubs to call def methods if they exist. This needs to be
fleshed out more, (another CEP :) but could provide for a
backwards-compatible easy first implementation.

>> Back to my and Robert's discussion on benchmarks:
>>
>> I've uploaded benchmarks here:
>>
>> https://github.com/dagss/hashvtable/tree/master/dispatchbench
>>
>> I've changed the benchmark taking to give more robust numbers (at
>> least for me), you want to look at the 'min' column.
>>
>> I changed the benchmark a bit so that it benchmarks a *callsite*.
>> So we don't pass 'h' on the stack, but either a) looks it up in a global
>> variable (default), or b) it's a compile-time constant (immediate in
>> assembly) (compile with -DIMHASH).
>>
>> Similarly, the ID is either an "interned" global variable, or an
>> immediate (-DIMID).
>>
>> The results are very different on my machine depending on this aspect.
>> My conclusions:
>>
>> - Both three shifts with masking, two shifts with a "fallback slot"
>> (allowing for a single collision), three shifts, two shifts with
>> two masks allows for constructing good vtables. In the case of only
>> two shifts, one colliding method gets the twoshift+fback
>> performance and the rest gets the twoshift performance.
>>
>> - Performance is really more affected by whether hashes are
>> immediates or global variables than the hash function. This is in
>> contrast to the interning vs. key benchmarks -- so I think that if
>> we looked up the vtable through PyTypeObject, rather than getting
>> the vtable directly, the loads of the global variables could
>> potentially be masked by that.
>>
>> - My conclusion: Just use lower bits of md5 *both* for the hashing
>> and the ID-ing (don't bother with any interning), and compile the
>> thing as a 64-bit immediate. This can cause crashes/stack smashes
>> etc. if there's lower-64bit-of-md5 collisions, but a) the
>> probability is incredibly small, b) it would only matter in
>> situations that should cause an AttributeError anyway, c) if we
>> really care, we can always use an interning-like mechanism to
>> validate on module loading that its hashes doesn't collide with
>> other hashes (and raise an exception "Congratulations, you've
>> discovered a phenomenal md5 collision, get in touch with cython
>> devs and we'll work around it right away").

Due to the birthday paradox, this seems a bit risky. Maybe it's
because I regularly work with collections much bigger than 2^32, and I
suppose we're talking about unique method names and signatures here,
but still... I wonder what the penalty would be for checking the full
128 bit hash. (Storing it could allow for greater entropy in the
optimal hash table search as well).

> What I forgot to mention:
>
>  - I really want to avoid linear probing just because of the code bloat in
> call sites.

That's a good point. What about flags--are we throwing out the idea of masking?

> With two shifts, when there was a failure to find a perfect hash
> it was always possible to find one with a single collision.
>
>  - Probing for the hash with two shifts is lightning fast, it can take a
> while with three shifts (though you can always spend more memory on a bigger
> table to make it fast again). However, it makes me uneasy to penalize the
> performance of calling one of the random methods, so I'm really in favour of
> three-shifts or double-mask (to be decided when investigating the
> performance of probing for parameters in more detail).
>
>  - I tried using SSE to do shifts in parallel and failed (miserable
> performance). The problem is quickly moving things between general purpose
> registers and SSE registers, and the lack of SSE immediates/constants in the
> instruction stream. At least, what my gcc 4.6 generates appeared to use the
> stack to communicate between SSE registers and general purpose registers
> (but I can't have been doing the right thing..).
>
>
>
>>
>> The RTTI (i.e. the char*) is also put in there, but is not used for
>> comparison and is not interned.
>>
>> At least, that's what I think we should do for duck-style vtables.
>>
>> Do we then go to all the pain of defining key-encoding, interning
>> etc. just for SEP 201? Isn't it easier to just mandate a md5 dependency
>> and be done with it? (After all, md5 usually comes with Python in the
>> md5 and hashlib modules)
>>
>> direct: Early-binding
>> index: Call slot 0 (C++-style vtable/function pointer)
>> noshift: h & m1
>> oneshift: (h >> r1) & m1
>> twoshift: ((h >> r1) ^ (h >> r2)) & m1
>> twoshift+fback: hash doesn't
>
>
> I meant: Hash collision and then, after a branch miss, look up the one
> fallback slot in the vtable header.

We could also do a fallback table. Usually it'd be empty, Occasionally
it'd have one element in it. It'd always be possible to make this big
enough to avoid collisions in a worst-case scenario.

BTW, this is a general static char* -> void* dictionary, I bet it
could possibly have other uses. (It may also be a well-studied
problem, though a bit hard to search for...) I suppose we could reduce
it to read-optimized int -> int mappings.

- Robert


More information about the cython-devel mailing list