[Cython] Hash-based vtables

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Fri Jun 8 23:12:58 CEST 2012


On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote:
> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote:
>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote:
>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote:
>>>>>
>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn
>>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>>
>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite
>>>>>>>> interesting. Though the resulting hash functions are supposedly
>>>>>>>> cheap,
>>>>>>>> I have the feeling that branching is considered cheap in this
>>>>>>>> context.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually, this lead was *very* promising. I believe the very first
>>>>>>> reference I actually read through and didn't eliminate after the
>>>>>>> abstract totally swept away our home-grown solutions!
>>>>>>>
>>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to
>>>>>>>
>>>>>>> understand, and fast both for generation and (the branch-free)
>>>>>>> lookup:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The idea is:
>>>>>>>
>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the
>>>>>>> paper
>>>>>>> requires b> 2n, though I think in practice you can often get away
>>>>>>> with
>>>>>>> less)
>>>>>>>
>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is
>>>>>>> easily achieved since groups only has a few elements)
>>>>>>>
>>>>>>> - For each group, from largest to smallest: Find a displacement
>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions.
>>>>>>>
>>>>>>> It requires extra storage for the displacement table. However, I
>>>>>>> think 8
>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in
>>>>>>> size.
>>>>>>> Even with 16 bits it's rather negligible compared to the
>>>>>>> minimum-128-bit
>>>>>>> entries of the table.
>>>>>>>
>>>>>>> I benchmarked these hash functions:
>>>>>>>
>>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1
>>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1
>>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1
>>>>>>>
>>>>>>>
>>>>>>> Only the third one is truly in the spirit of the algorithm, but I
>>>>>>> think
>>>>>>> the first two should work well too (and when h is known
>>>>>>> compile-time,
>>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1).
>>>>>>>
>>>>>>>
>>>>>>> My computer is acting up and all my numbers today are slower than
>>>>>>> the
>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year
>>>>>>> ago,
>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers,
>>>>>>> compiled with -DIMHASH:
>>>>>>>
>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11
>>>>>>> val=2400000000.000000
>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000
>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11
>>>>>>> val=1800000000.000000
>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11
>>>>>>> val=1800000000.000000
>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11
>>>>>>> val=1800000000.000000
>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11
>>>>>>> val=1800000000.000000
>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11
>>>>>>> val=1800000000.000000
>>>>>>>
>>>>>>>
>>>>>>> I did a dirty prototype of the table-finder as well and it works:
>>>>>>>
>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>>>>
>>>>>>
>>>>>>
>>>>>> The paper obviously puts more effort on minimizing table size and
>>>>>> not a
>>>>>> fast
>>>>>> lookup. My hunch is that our choice should be
>>>>>>
>>>>>> ((h>> table.r) ^ table.d[h& m2])& m1
>>>>>>
>>>>>>
>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather
>>>>>> double
>>>>>> the number of bins than those 2 extra bits available for displacement
>>>>>> options).
>>>>>>
>>>>>> Then keep incrementing the size of d and the number of table slots
>>>>>> (in
>>>>>> such
>>>>>> an order that the total vtable size is minimized) until success. In
>>>>>> practice
>>>>>> this should almost always just increase the size of d, and keep the
>>>>>> table
>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or
>>>>>> 128
>>>>>> methods :-))
>>>>>>
>>>>>> Essentially we avoid the shift in the argument to d[] by making d
>>>>>> larger.
>>>>>
>>>>>
>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much;
>>>>
>>>>
>>>> Well, table->d[const& const] compiles down to the same kind of code as
>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize.
>>>>
>>>>
>>>>> hopefully its size wouldn't be a big issue either. What kinds of
>>>>> densities were you achieving?
>>
>> OK, simulation results just in (for the displace2 hash), and they
>> exceeded my expectations.
>>
>> I always fill the table with n=2^k keys, and fix b = n (b means |d|).
>> Then the failure rates are (top two are 100,000 simulations, the rest
>> are 1000 simulations):
>>
>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65
>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65
>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25
>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29
>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22
>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37
>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26
>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34
>>
>> Try-mean and try-max is how many r's needed to be tried before success,
>> so it gives an indication how much is left before failure.
>>
>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to
>> let b=2*n (100,000 simulations):
>>
>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65
>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65
>>
>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit
>> displacements they mostly failed. So we either need to make each element
>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which
>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits
>> displacements.
>>
>> The algorithm is rather fast and concise:
>>
>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>
>>>> The algorithm is designed for 100% density in the table itself. (We
>>>> can lift
>>>> that to compensate for a small space of possible hash functions I
>>>> guess.)
>>>>
>>>> I haven't done proper simulations yet, but I just tried |vtable|=128,
>>>> |d|=128 from the command line and I had 15 successes or so before the
>>>> first
>>>> failure. That's with a 100% density in the vtable itself! (And when it
>>>> fails, you increase |d| to get your success).
>>>>
>>>> The caveat is the space spent on d (it's small in comparison, but
>>>> that's why
>>>> this isn't too good to be true).
>>>>
>>>> A disadvantage might be that we may no longer have the opportunity to
>>>> not
>>>> make the table size a power of two (i.e. replace the mask with "if
>>>> (likely(slot< n))"). I think for that to work one would need to
>>>> replace the
>>>> xor group with addition on Z_d.
>>>>
>>>>
>>>>> Going back to the idea of linear probing on a cache miss, this has the
>>>>> advantage that one can write a brain-dead provider that sets m=0 and
>>>>> simply lists the methods instead of requiring a table optimizer. (Most
>>>>> tools, of course, would do the table optimization.) It also lets you
>>>>> get away with a "kind-of good" hash rather than requiring you search
>>>>> until you find a (larger?) perfect one.
>>>>
>>>>
>>>> Well, given that we can have 100% density, and generating the table is
>>>> lightning fast, and the C code to generate the table is likely a 300
>>>> line
>>>> utility... I'm not convinced.
>>>
>>> It goes from an extraordinary simple spec (table is, at minimum, a
>>> func[2^k] with a couple of extra zero fields, whose struct can be
>>> statically defined in the source by hand) to a, well, not complicated
>>> in the absolute sense, but much more so than the definition above. It
>>> also is variable-size which makes allocating it globally/on a stack a
>>> pain (I suppose one can choose an upper bound for |d| and |vtable|).
>>>
>>> I am a bit playing devil's advocate here, it's probably just a (minor)
>>> con, but worth noting at least.
>>
>> If you were willing to go the interning route, so that you didn't need
>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger
>> point :-)
>>
>> Given the results above, static allocation can at least be solved in a
>> way that is probably user-friendly enough:
>>
>> PyHashVTable_16_16 mytable;
>>
>> ...init () {
>> mytable.functions = { ... };
>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1;
>> }
>>
>> Now, with chance ~1/1000, you're going to get an exception saying
>> "Please try PyHashVTable_16_32". (And since that's deterministic given
>> the function definitions you always catch it at once.)
>
> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc.
> as well.


There's still the indirection through SEP 200 (extensibletype slots). We 
can get rid of that very easily by just making that table and the 
hash-vtable one and the same. (It could still either have interned 
string keys or ID keys depending on the least significant bit.)

To wrap up, I think this has grown in complexity beyond the "simple SEP 
spec". It's at the point where you don't really want to have several 
libraries implementing the same simple spec, but instead use the same 
implementation.

But I think the advantages are simply too good to give up on.

So I think a viable route forward is to forget the 
CEP/SEP/pre-PEP-approach for now (which only works for semi-complicated 
ideas with simple implementations) and instead simply work more directly 
on a library. It would need to have a couple of different use modes:

  - A Python perfect-hasher for use when generating code, with only the 
a string interner based on CPython dicts and extensibletype metaclass as 
runtime dependencies (for use in Cython). This would only add some 
hundred source file lines...

  - A C implementation of the perfect hashing exposed through a 
PyPerfectHashTable_Ready(), for use in libraries written in C like 
NumPy/SciPy). This would need to bundle the md5 algorithm and a C 
implementation of the perfect hashing.

And on the distribution axis:

  - Small C header-style implementation of a string interner and the 
extensibletype metaclass, rendezvousing through sys.modules

  - As part of the rendezvous, one would always try to __import__ the 
*real* run-time library. So if it is available in sys.path it overrides 
anything bundled with other libraries. That would provide a way forward 
for GIL-less string interning, a Python-side library for working with 
these tables and inspecting them, etc.

Time to stop talking and start coding...

Dag


More information about the cython-devel mailing list