[Python-ideas] Reducing collisions in small dicts/sets

Tim Peters tim.peters at gmail.com
Sun Jun 25 18:10:41 EDT 2017

Two bits of new info:  first, it's possible to get the performance of
"double" without division, at least via this way:

"""
# Double hashing using Fibonacci multiplication for the increment.  This
# does about as well as `double`, but doesn't require division.
#
# The multiplicative constant depends on the word size W, and is the
# nearest odd integer to 2**W/((1 + sqrt(5))/2).  So for a 64-bit box:
#
# >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2)
# Decimal('11400714819323198485.95161059')
#
# For a 32-bit box, it's 2654435769.
#
# The result is the high-order `nbits` bits of the low-order W bits of
# the product.  In C, the "& M" part isn't needed (unsigned * in C
# returns only the low-order bits to begin with).
#
# Annoyance:  I don't think Python dicts store `nbits`, just 2**nbits.
def dfib(h, nbits, M=M64):
mask = (1 << nbits) - 1
yield i
inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1
while True:
i = (i + inc) & mask
yield i
"""

Second, the program I posted uses string objects as test cases.  The
current string hash acts like a good-quality pseudo-random number
generator:  change a character in the string, and the hash typically
changes "a lot".

It's also important to look at hashes with common patterns, because,
e.g., all "sufficiently small" integers are their _own_ hash codes
(e.g., hash(3) == 3).  Indeed, long ago Python changed its dict
implementation to avoid quadratic-time behavior for unfortunate sets
of real-life integer keys.

Example:  change `objgen()` to `yield i << 12` instead of `yield
str(i)`.  Then we generate integers all of whose final 12 bits are
zeroes.  For all sufficiently small tables, then, these ints _all_ map
to the same initial table slot (since the initial probe is taken from
the last `nbits` bits).  The collision resolution scheme is needed to
prevent disaster.

Here's a chunk of output from that, for dicts of size 2,730:

"""
bits 12 nslots 4,096 dlen 2,730 alpha 0.67 # built 37
theoretical avg probes for uniform hashing when found 1.65 not found 3.00
prober linear
found min 1:0.04% max 2730 mean 1365.50
fail  min 2731:100.00% max 2731 mean 2731.00
found min 1:0.04% max 2730 mean 1365.50
fail  min 2731:100.00% max 2731 mean 2731.00
prober pre28201
found min 1:0.04% max 61 mean 6.17
fail  min 5:28.94% max 68 mean 9.78
prober current
found min 1:0.04% max 58 mean 5.17
fail  min 4:29.30% max 70 mean 8.62
prober double
found min 1:0.04% max 5 mean 2.73
fail  min 2:41.47% max 9 mean 3.94
prober dfib
found min 1:0.04% max 9 mean 2.53
fail  min 2:10.87% max 17 mean 4.48
prober uniform
found min 1:66.52% max 21 mean 1.65
fail  min 1:33.41% max 30 mean 2.99
"""

It's a worst-case disaster for linear and quadratic probing.  The
`current` scheme does fine, but `double` and `dfib` do significantly
better on all measures shown.  `uniform` is oblivious, still achieving
its theoretical average-case performance.

That last isn't surprising "in theory", since it theoretically picks a
probe sequence uniformly at random from the set of all table slot
permutations.  It's perhaps a bit surprising that this approximate
_implementation_ also achieves it.  But `random.seed(h)` does a
relatively enormous amount of work to spray the bits of `h` all over
the Twister's internal state, and then shuffle them around.

In any case, the "double hashing" variants ("double" and "dfib") look
very promising, doing better than the current scheme for both small
tables and pathological key sets.