[Cython] Hash-based vtables

Sat Jun 9 07:45:55 CEST 2012

On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote:
>>>
>>> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote:
>>>>
>>>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote:
>>>>>
>>>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn
>>>>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>>>>
>>>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn
>>>>>>> <d.s.seljebotn at astro.uio.no>  wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite
>>>>>>>>>> interesting. Though the resulting hash functions are supposedly
>>>>>>>>>> cheap,
>>>>>>>>>> I have the feeling that branching is considered cheap in this
>>>>>>>>>> context.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Actually, this lead was *very* promising. I believe the very first
>>>>>>>>> reference I actually read through and didn't eliminate after the
>>>>>>>>> abstract totally swept away our home-grown solutions!
>>>>>>>>>
>>>>>>>>> "Hash&  Displace" by Pagh (1999) is actually very simple, easy to
>>>>>>>>>
>>>>>>>>> understand, and fast both for generation and (the branch-free)
>>>>>>>>> lookup:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The idea is:
>>>>>>>>>
>>>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the
>>>>>>>>> paper
>>>>>>>>> requires b>  2n, though I think in practice you can often get away
>>>>>>>>> with
>>>>>>>>> less)
>>>>>>>>>
>>>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is
>>>>>>>>> easily achieved since groups only has a few elements)
>>>>>>>>>
>>>>>>>>> - For each group, from largest to smallest: Find a displacement
>>>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions.
>>>>>>>>>
>>>>>>>>> It requires extra storage for the displacement table. However, I
>>>>>>>>> think 8
>>>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in
>>>>>>>>> size.
>>>>>>>>> Even with 16 bits it's rather negligible compared to the
>>>>>>>>> minimum-128-bit
>>>>>>>>> entries of the table.
>>>>>>>>>
>>>>>>>>> I benchmarked these hash functions:
>>>>>>>>>
>>>>>>>>> displace1: ((h>>  r1) ^ d[h&  63])&  m1
>>>>>>>>> displace2: ((h>>  r1) ^ d[h&  m2])&  m1
>>>>>>>>> displace3: ((h>>  r1) ^ d[(h>>  r2)&  m2])&  m1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Only the third one is truly in the spirit of the algorithm, but I
>>>>>>>>> think
>>>>>>>>> the first two should work well too (and when h is known
>>>>>>>>> compile-time,
>>>>>>>>> looking up d[h&  63] isn't harder than looking up r1 or m1).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My computer is acting up and all my numbers today are slower than
>>>>>>>>> the
>>>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year
>>>>>>>>> ago,
>>>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers,
>>>>>>>>> compiled with -DIMHASH:
>>>>>>>>>
>>>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11
>>>>>>>>> val=2400000000.000000
>>>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000
>>>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11
>>>>>>>>> val=1800000000.000000
>>>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11
>>>>>>>>> val=1800000000.000000
>>>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11
>>>>>>>>> val=1800000000.000000
>>>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11
>>>>>>>>> val=1800000000.000000
>>>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11
>>>>>>>>> val=1800000000.000000
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I did a dirty prototype of the table-finder as well and it works:
>>>>>>>>>
>>>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The paper obviously puts more effort on minimizing table size and
>>>>>>>> not a
>>>>>>>> fast
>>>>>>>> lookup. My hunch is that our choice should be
>>>>>>>>
>>>>>>>> ((h>>  table.r) ^ table.d[h&  m2])&  m1
>>>>>>>>
>>>>>>>>
>>>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather
>>>>>>>> double
>>>>>>>> the number of bins than those 2 extra bits available for displacement
>>>>>>>> options).
>>>>>>>>
>>>>>>>> Then keep incrementing the size of d and the number of table slots
>>>>>>>> (in
>>>>>>>> such
>>>>>>>> an order that the total vtable size is minimized) until success. In
>>>>>>>> practice
>>>>>>>> this should almost always just increase the size of d, and keep the
>>>>>>>> table
>>>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or
>>>>>>>> 128
>>>>>>>> methods :-))
>>>>>>>>
>>>>>>>> Essentially we avoid the shift in the argument to d[] by making d
>>>>>>>> larger.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much;
>>>>>>
>>>>>>
>>>>>>
>>>>>> Well, table->d[const&  const] compiles down to the same kind of code as
>>>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize.
>>>>>>
>>>>>>
>>>>>>> hopefully its size wouldn't be a big issue either. What kinds of
>>>>>>> densities were you achieving?
>>>>
>>>>
>>>> OK, simulation results just in (for the displace2 hash), and they
>>>> exceeded my expectations.
>>>>
>>>> I always fill the table with n=2^k keys, and fix b = n (b means |d|).
>>>> Then the failure rates are (top two are 100,000 simulations, the rest
>>>> are 1000 simulations):
>>>>
>>>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65
>>>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65
>>>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25
>>>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29
>>>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22
>>>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37
>>>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26
>>>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34
>>>>
>>>> Try-mean and try-max is how many r's needed to be tried before success,
>>>> so it gives an indication how much is left before failure.
>>>>
>>>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to
>>>> let b=2*n (100,000 simulations):
>>>>
>>>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65
>>>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65
>>>>
>>>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit
>>>> displacements they mostly failed. So we either need to make each element
>>>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which
>>>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits
>>>> displacements.
>>>>
>>>> The algorithm is rather fast and concise:
>>>>
>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>>
>>>>>> The algorithm is designed for 100% density in the table itself. (We
>>>>>> can lift
>>>>>> that to compensate for a small space of possible hash functions I
>>>>>> guess.)
>>>>>>
>>>>>> I haven't done proper simulations yet, but I just tried |vtable|=128,
>>>>>> |d|=128 from the command line and I had 15 successes or so before the
>>>>>> first
>>>>>> failure. That's with a 100% density in the vtable itself! (And when it
>>>>>> fails, you increase |d| to get your success).
>>>>>>
>>>>>> The caveat is the space spent on d (it's small in comparison, but
>>>>>> that's why
>>>>>> this isn't too good to be true).
>>>>>>
>>>>>> A disadvantage might be that we may no longer have the opportunity to
>>>>>> not
>>>>>> make the table size a power of two (i.e. replace the mask with "if
>>>>>> (likely(slot<  n))"). I think for that to work one would need to
>>>>>> replace the
>>>>>> xor group with addition on Z_d.
>>>>>>
>>>>>>
>>>>>>> Going back to the idea of linear probing on a cache miss, this has the
>>>>>>> advantage that one can write a brain-dead provider that sets m=0 and
>>>>>>> simply lists the methods instead of requiring a table optimizer. (Most
>>>>>>> tools, of course, would do the table optimization.) It also lets you
>>>>>>> get away with a "kind-of good" hash rather than requiring you search
>>>>>>> until you find a (larger?) perfect one.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Well, given that we can have 100% density, and generating the table is
>>>>>> lightning fast, and the C code to generate the table is likely a 300
>>>>>> line
>>>>>> utility... I'm not convinced.
>>>>>
>>>>>
>>>>> It goes from an extraordinary simple spec (table is, at minimum, a
>>>>> func[2^k] with a couple of extra zero fields, whose struct can be
>>>>> statically defined in the source by hand) to a, well, not complicated
>>>>> in the absolute sense, but much more so than the definition above. It
>>>>> also is variable-size which makes allocating it globally/on a stack a
>>>>> pain (I suppose one can choose an upper bound for |d| and |vtable|).
>>>>>
>>>>> I am a bit playing devil's advocate here, it's probably just a (minor)
>>>>> con, but worth noting at least.
>>>>
>>>>
>>>> If you were willing to go the interning route, so that you didn't need
>>>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger
>>>> point :-)
>>>>
>>>> Given the results above, static allocation can at least be solved in a
>>>> way that is probably user-friendly enough:
>>>>
>>>> PyHashVTable_16_16 mytable;
>>>>
>>>> ...init () {
>>>> mytable.functions = { ... };
>>>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1;
>>>> }
>>>>
>>>> Now, with chance ~1/1000, you're going to get an exception saying
>>>> "Please try PyHashVTable_16_32". (And since that's deterministic given
>>>> the function definitions you always catch it at once.)
>>>
>>>
>>> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc.
>>> as well.
>>
>>
>>
>> There's still the indirection through SEP 200 (extensibletype slots). We can
>> get rid of that very easily by just making that table and the hash-vtable
>> one and the same. (It could still either have interned string keys or ID
>> keys depending on the least significant bit.)
>
> Or we can even forgo the interning for this table, and give up on
> partitioning the space numerically and just use the dns-style
> prefixing, e.g. "org.cython.X" belongs to us.

Huh? Isn't that when you *need* interning? Do you plan on key-encoding 
those kind of strings into 64 bits?

(I think it would usually be "method:foo:ii->d" (or my current 
preference is "method:foo:i4i8->f8"))

Partitioning the space numerically you'd just hash the number; "SEP 260: 
We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".

> There is value in the double indirection if this (or any of the other)
> lookup tables are meant to be modified over time.

This isn't impossible with a hash table either. You just need to 
reallocate a little more often than what would be the case with a 
regular hash table, but not dramatically so (you need to rehash whenever 
the element to insert hashes into a "large" bin, which are rather few).

I want the table to have a pointer to it, so that you can atomically 
swap it out.

>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>> spec". It's at the point where you don't really want to have several
>> libraries implementing the same simple spec, but instead use the same
>> implementation.
>>
>> But I think the advantages are simply too good to give up on.
>>
>> So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach
>> for now (which only works for semi-complicated ideas with simple
>> implementations) and instead simply work more directly on a library. It
>> would need to have a couple of different use modes:
>
> I prefer an enhancement proposal with a spec over a library, even if
> only a single library gets used in practice. I still think it's simple
> enough. Basically, we have the "lookup spec" and then a CEP for
> applying this to fast callable (agreeing on signatures, and what to do
> with extern types) and extensible type slots.

OK.

>
>>   - A Python perfect-hasher for use when generating code, with only the a
>> string interner based on CPython dicts and extensibletype metaclass as
>> runtime dependencies (for use in Cython). This would only add some hundred
>> source file lines...
>>
>>   - A C implementation of the perfect hashing exposed through a
>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>> implementation of the perfect hashing.
>>
>> And on the distribution axis:
>>
>>   - Small C header-style implementation of a string interner and the
>> extensibletype metaclass, rendezvousing through sys.modules
>>
>>   - As part of the rendezvous, one would always try to __import__ the *real*
>> run-time library. So if it is available in sys.path it overrides anything
>> bundled with other libraries. That would provide a way forward for GIL-less
>> string interning, a Python-side library for working with these tables and
>> inspecting them, etc.
>
> Hmm, that's an interesting idea. I think we don't actually need
> interning, in which case the "full" library is only needed for
> introspection.

You don't believe the security concern is real then? Or do you want to 
pay the cost for a 160-bit SHA1 compare everywhere?

I'd love to not do interning, but I see no way around it.

BTW, a GIL-less interning library isn't rocket science. I ran khash.h 
through a preprocessor with

KHASH_MAP_INIT_STR(str_to_entry, entry_t)

and the result is 180 lines of code for the hash table. Then pythread.h 
provides the thread lock, another 50 lines for the interning logic 
(intern_literal, intern_heap_allocated, release_interned).

It just seems a little redundant to ship such a thing in every 
Cython-generated file since we hold the GIL during module loading anyway.

Dag