[Cython] Hash-based vtables

Sat Jun 9 03:21:23 CEST 2012

On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote:
>>
>> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote:
>>>
>>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote:
>>>>
>>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn
>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>
>>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote:
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn
>>>>>> <d.s.seljebotn at astro.uio.no> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite
>>>>>>>>> interesting. Though the resulting hash functions are supposedly
>>>>>>>>> cheap,
>>>>>>>>> I have the feeling that branching is considered cheap in this
>>>>>>>>> context.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually, this lead was *very* promising. I believe the very first
>>>>>>>> reference I actually read through and didn't eliminate after the
>>>>>>>> abstract totally swept away our home-grown solutions!
>>>>>>>>
>>>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to
>>>>>>>>
>>>>>>>> understand, and fast both for generation and (the branch-free)
>>>>>>>> lookup:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The idea is:
>>>>>>>>
>>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the
>>>>>>>> paper
>>>>>>>> requires b> 2n, though I think in practice you can often get away
>>>>>>>> with
>>>>>>>> less)
>>>>>>>>
>>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is
>>>>>>>> easily achieved since groups only has a few elements)
>>>>>>>>
>>>>>>>> - For each group, from largest to smallest: Find a displacement
>>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions.
>>>>>>>>
>>>>>>>> It requires extra storage for the displacement table. However, I
>>>>>>>> think 8
>>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in
>>>>>>>> size.
>>>>>>>> Even with 16 bits it's rather negligible compared to the
>>>>>>>> minimum-128-bit
>>>>>>>> entries of the table.
>>>>>>>>
>>>>>>>> I benchmarked these hash functions:
>>>>>>>>
>>>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1
>>>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1
>>>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1
>>>>>>>>
>>>>>>>>
>>>>>>>> Only the third one is truly in the spirit of the algorithm, but I
>>>>>>>> think
>>>>>>>> the first two should work well too (and when h is known
>>>>>>>> compile-time,
>>>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1).
>>>>>>>>
>>>>>>>>
>>>>>>>> My computer is acting up and all my numbers today are slower than
>>>>>>>> the
>>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year
>>>>>>>> ago,
>>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers,
>>>>>>>> compiled with -DIMHASH:
>>>>>>>>
>>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11
>>>>>>>> val=2400000000.000000
>>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000
>>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11
>>>>>>>> val=1800000000.000000
>>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11
>>>>>>>> val=1800000000.000000
>>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11
>>>>>>>> val=1800000000.000000
>>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11
>>>>>>>> val=1800000000.000000
>>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11
>>>>>>>> val=1800000000.000000
>>>>>>>>
>>>>>>>>
>>>>>>>> I did a dirty prototype of the table-finder as well and it works:
>>>>>>>>
>>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The paper obviously puts more effort on minimizing table size and
>>>>>>> not a
>>>>>>> fast
>>>>>>> lookup. My hunch is that our choice should be
>>>>>>>
>>>>>>> ((h>> table.r) ^ table.d[h& m2])& m1
>>>>>>>
>>>>>>>
>>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather
>>>>>>> double
>>>>>>> the number of bins than those 2 extra bits available for displacement
>>>>>>> options).
>>>>>>>
>>>>>>> Then keep incrementing the size of d and the number of table slots
>>>>>>> (in
>>>>>>> such
>>>>>>> an order that the total vtable size is minimized) until success. In
>>>>>>> practice
>>>>>>> this should almost always just increase the size of d, and keep the
>>>>>>> table
>>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or
>>>>>>> 128
>>>>>>> methods :-))
>>>>>>>
>>>>>>> Essentially we avoid the shift in the argument to d[] by making d
>>>>>>> larger.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much;
>>>>>
>>>>>
>>>>>
>>>>> Well, table->d[const& const] compiles down to the same kind of code as
>>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize.
>>>>>
>>>>>
>>>>>> hopefully its size wouldn't be a big issue either. What kinds of
>>>>>> densities were you achieving?
>>>
>>>
>>> OK, simulation results just in (for the displace2 hash), and they
>>> exceeded my expectations.
>>>
>>> I always fill the table with n=2^k keys, and fix b = n (b means |d|).
>>> Then the failure rates are (top two are 100,000 simulations, the rest
>>> are 1000 simulations):
>>>
>>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65
>>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65
>>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25
>>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29
>>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22
>>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37
>>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26
>>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34
>>>
>>> Try-mean and try-max is how many r's needed to be tried before success,
>>> so it gives an indication how much is left before failure.
>>>
>>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to
>>> let b=2*n (100,000 simulations):
>>>
>>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65
>>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65
>>>
>>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit
>>> displacements they mostly failed. So we either need to make each element
>>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which
>>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits
>>> displacements.
>>>
>>> The algorithm is rather fast and concise:
>>>
>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py
>>>
>>>>> The algorithm is designed for 100% density in the table itself. (We
>>>>> can lift
>>>>> that to compensate for a small space of possible hash functions I
>>>>> guess.)
>>>>>
>>>>> I haven't done proper simulations yet, but I just tried |vtable|=128,
>>>>> |d|=128 from the command line and I had 15 successes or so before the
>>>>> first
>>>>> failure. That's with a 100% density in the vtable itself! (And when it
>>>>> fails, you increase |d| to get your success).
>>>>>
>>>>> The caveat is the space spent on d (it's small in comparison, but
>>>>> that's why
>>>>> this isn't too good to be true).
>>>>>
>>>>> A disadvantage might be that we may no longer have the opportunity to
>>>>> not
>>>>> make the table size a power of two (i.e. replace the mask with "if
>>>>> (likely(slot< n))"). I think for that to work one would need to
>>>>> replace the
>>>>> xor group with addition on Z_d.
>>>>>
>>>>>
>>>>>> Going back to the idea of linear probing on a cache miss, this has the
>>>>>> advantage that one can write a brain-dead provider that sets m=0 and
>>>>>> simply lists the methods instead of requiring a table optimizer. (Most
>>>>>> tools, of course, would do the table optimization.) It also lets you
>>>>>> get away with a "kind-of good" hash rather than requiring you search
>>>>>> until you find a (larger?) perfect one.
>>>>>
>>>>>
>>>>>
>>>>> Well, given that we can have 100% density, and generating the table is
>>>>> lightning fast, and the C code to generate the table is likely a 300
>>>>> line
>>>>> utility... I'm not convinced.
>>>>
>>>>
>>>> It goes from an extraordinary simple spec (table is, at minimum, a
>>>> func[2^k] with a couple of extra zero fields, whose struct can be
>>>> statically defined in the source by hand) to a, well, not complicated
>>>> in the absolute sense, but much more so than the definition above. It
>>>> also is variable-size which makes allocating it globally/on a stack a
>>>> pain (I suppose one can choose an upper bound for |d| and |vtable|).
>>>>
>>>> I am a bit playing devil's advocate here, it's probably just a (minor)
>>>> con, but worth noting at least.
>>>
>>>
>>> If you were willing to go the interning route, so that you didn't need
>>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger
>>> point :-)
>>>
>>> Given the results above, static allocation can at least be solved in a
>>> way that is probably user-friendly enough:
>>>
>>> PyHashVTable_16_16 mytable;
>>>
>>> ...init () {
>>> mytable.functions = { ... };
>>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1;
>>> }
>>>
>>> Now, with chance ~1/1000, you're going to get an exception saying
>>> "Please try PyHashVTable_16_32". (And since that's deterministic given
>>> the function definitions you always catch it at once.)
>>
>>
>> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc.
>> as well.
>
>
>
> There's still the indirection through SEP 200 (extensibletype slots). We can
> get rid of that very easily by just making that table and the hash-vtable
> one and the same. (It could still either have interned string keys or ID
> keys depending on the least significant bit.)

Or we can even forgo the interning for this table, and give up on
partitioning the space numerically and just use the dns-style
prefixing, e.g. "org.cython.X" belongs to us.

There is value in the double indirection if this (or any of the other)
lookup tables are meant to be modified over time.

> To wrap up, I think this has grown in complexity beyond the "simple SEP
> spec". It's at the point where you don't really want to have several
> libraries implementing the same simple spec, but instead use the same
> implementation.
>
> But I think the advantages are simply too good to give up on.
>
> So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach
> for now (which only works for semi-complicated ideas with simple
> implementations) and instead simply work more directly on a library. It
> would need to have a couple of different use modes:

I prefer an enhancement proposal with a spec over a library, even if
only a single library gets used in practice. I still think it's simple
enough. Basically, we have the "lookup spec" and then a CEP for
applying this to fast callable (agreeing on signatures, and what to do
with extern types) and extensible type slots.

>  - A Python perfect-hasher for use when generating code, with only the a
> string interner based on CPython dicts and extensibletype metaclass as
> runtime dependencies (for use in Cython). This would only add some hundred
> source file lines...
>
>  - A C implementation of the perfect hashing exposed through a
> PyPerfectHashTable_Ready(), for use in libraries written in C like
> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
> implementation of the perfect hashing.
>
> And on the distribution axis:
>
>  - Small C header-style implementation of a string interner and the
> extensibletype metaclass, rendezvousing through sys.modules
>
>  - As part of the rendezvous, one would always try to __import__ the *real*
> run-time library. So if it is available in sys.path it overrides anything
> bundled with other libraries. That would provide a way forward for GIL-less
> string interning, a Python-side library for working with these tables and
> inspecting them, etc.

Hmm, that's an interesting idea. I think we don't actually need
interning, in which case the "full" library is only needed for
introspection.

> Time to stop talking and start coding...
>
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel