[Python-ideas] Reducing collisions in small dicts/sets
Tim Peters
tim.peters at gmail.com
Mon Jun 26 14:21:03 EDT 2017
Some explanations and cautions. An advantage of sticking with pure
Python instead of C is that spare moments are devoted to investigating
the problem instead of thrashing with micro-optimizations ;-)
Why does the current scheme suffer for small tables? With hindsight
it's pretty obvious: it can visit table slots more than once (the
other schemes cannot), and while the odds of that happening are tiny
in large tables, they're high for small tables.
For example, in a 3-bit table (8 slots), suppose the initial index is
1, and there's a collision. The 5*i+1 recurrence _would_ visit slot 6
next, but we also shift in 5 new bits and add them. If those bits are
"random", there's a 1-in-8 chance that we'll visit index 1 _again_.
If we don't, and have another collision, shifting in the next 5 bits
has a 2-in-8 chance of repeating one of the two slots we already
visited. And so on. Here using the `str(i)` generator (which yields
random-ish hash codes):
"""
bits 3 nslots 8 dlen 5 alpha 0.62 # built 20,000
...
prober current
found min 1:74.80% max 15 mean 1.42
fail min 1:37.62% max 18 mean 2.66
"""
So despite that only 5 slots are filled, in at least one case it took
15 probes to find an existing key, and 18 probes to realize a key was
missing. In the other schemes, it takes at most 5 probes for success
and 6 probes for failure.
This is worse for 64-bit hash codes than for 32-bit ones, because we
can loop around twice as often before `perturb` permanently becomes 0
(at which point the pure 5*i+1 recurrence visits each slot at most
once).
The larger the table, the less likely a repeat due to `perturb`
becomes. For example, suppose we have an 8-bit table and again visit
index 1 first. We may visit any index in range(6, 6+32) next
(depending on the 5 fresh bits shifted in), but repeating 1 is _not_ a
possibility.
Why do the double hashing methods do better for the `i << 12`
generator? In any case where the hash codes have "a lot" of trailing
bits in common, they all map to the same table index at first, and the
probe sequence remains the same for all until the loop shifts
`perturb` far enough right that the rightmost differing bits finally
show up in the addition.
In the double hashing methods, _all_ the bits of the hash code affect
the value of `inc` computed before the loop starts, so they have a
good chance of differing already on the second probe.
This is again potentially worse for the current scheme with 64-bit
hash codes, since the sheer number of common trailing bits _can_ be up
to 63.
However, the double-hashing schemes have pathological cases too, that
the current scheme avoids. The first I tried was a `yield i * 1023`
generator. These are spectacularly _good_ values for all schemes
except `uniform` for successful searches, because
i*1023 = j*1023 (mod 2**k)
implies
i=j (mod 2**k)
That is, there are no first-probe collisions at all across a
contiguous range of `i` with no more than 2**k values. But the
`double` scheme for a 10-bit table degenerates to linear probing in
this case, because
inc = (h % mask) | 1 # force it odd
is always 1 when h is divisible by 1023 (== mask for a 10-bit table).
This is terrible for a failing search; e.g., for a 20-bit table (note
that 2**20-1 is divisible by 1023):
"""
bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1
theoretical avg probes for uniform hashing when found 1.65 not found 3.00
...
prober current
found min 1:100.00% max 1 mean 1.00
fail min 1:33.33% max 34 mean 3.04
prober double
found min 1:100.00% max 1 mean 1.00
fail min 1:33.33% max 699049 mean 1867.51
prober dfib
found min 1:100.00% max 1 mean 1.00
fail min 1:33.33% max 427625 mean 8.09
prober uniform
found min 1:66.65% max 24 mean 1.65
fail min 1:33.35% max 35 mean 3.00
"""
While that's a failing-search disaster for `double`, it's also bad for
`dfib` (& I don't have a slick explanation for that).
So where does that leave us? I don't know. All schemes have good &
bad points ;-) I haven't yet thought of a cheap way to compute an
`inc` for double-hashing that isn't vulnerable to bad behavior for
_some_ easily constructed set of int keys. If you forget "cheap",
it's easy; e.g.,
random.seed(h)
inc = random.choice(range(1, mask + 1, 2))
Regardless, I'll attach the current version of the code.
-------------- next part --------------
MIN_ELTS = 100_000
M64 = (1 << 64) - 1
def phash(obj, M=M64): # hash(obj) as uint64
return hash(obj) & M
# Probers: generate sequence of table indices to look at,
# in table of size 2**nbits, for object with uint64 hash code h.
def linear(h, nbits):
mask = (1 << nbits) - 1
i = h & mask
while True:
yield i
i = (i + 1) & mask
# offsets of 0, 1, 3, 6, 10, 15, ...
# this permutes the index range when the table size is a power of 2
def quadratic(h, nbits):
mask = (1 << nbits) - 1
i = h & mask
inc = 1
while True:
yield i
i = (i + inc) & mask
inc += 1
def pre28201(h, nbits):
mask = (1 << nbits) - 1
i = h & mask
while True:
yield i
i = (5*i + h + 1) & mask
h >>= 5
def current(h, nbits):
mask = (1 << nbits) - 1
i = h & mask
while True:
yield i
h >>= 5
i = (5*i + h + 1) & mask
# One version of "double hashing". The increment between probes is
# fixed, but varies across objects. This does very well! Note that the
# increment needs to be relatively prime to the table size so that all
# possible indices are generated. Because our tables have power-of-2
# sizes, we merely need to ensure the increment is odd.
# Using `h % mask` is akin to "casting out 9's" in decimal: it's as if
# we broke the hash code into nbits-wide chunks from the right, then
# added them, then repeated that procedure until only one "digit"
# remains. All bits in the hash code affect the result.
# While mod is expensive, a successful search usual gets out on the
# first try, & then the lookup can return before the mod completes.
def double(h, nbits):
mask = (1 << nbits) - 1
i = h & mask
yield i
inc = (h % mask) | 1 # force it odd
while True:
i = (i + inc) & mask
yield i
# Double hashing using Fibonacci multiplication for the increment. This
# does about as well as `double`, but doesn't require division.
#
# The multiplicative constant depends on the word size W, and is the
# nearest odd integer to 2**W/((1 + sqrt(5))/2). So for a 64-bit box:
#
# >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2)
# Decimal('11400714819323198485.95161059')
#
# For a 32-bit box, it's 2654435769.
#
# The result is the high-order `nbits` bits of the low-order W bits of
# the product. In C, the "& M" part isn't needed (unsigned * in C
# returns only the low-order bits to begin with).
#
# Annoyance: I don't think Python dicts store `nbits`, just 2**nbits.
def dfib(h, nbits, M=M64):
mask = (1 << nbits) - 1
i = h & mask
yield i
inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1
while True:
i = (i + inc) & mask
yield i
# The theoretical "gold standard": generate a random permutation of the
# table indices for each object. We can't actually do that, but
# Python's PRNG gets close enough that there's no practical difference.
def uniform(h, nbits):
from random import seed, randrange
seed(h)
n = 1 << nbits
seen = set()
while True:
assert len(seen) < n
while True:
i = randrange(n)
if i not in seen:
break
seen.add(i)
yield i
def spray(nbits, objs, cs, prober, *, used=None, shift=5):
building = used is None
nslots = 1 << nbits
if building:
used = [0] * nslots
assert len(used) == nslots
for o in objs:
n = 1
for i in prober(phash(o), nbits):
if used[i]:
n += 1
else:
break
if building:
used[i] = 1
cs[n] += 1
return used
def objgen(i=1):
while True:
yield str(i)
#yield i << 12
# i*1023 gives a unique first probe, but is deadly
# on failing searches for `double` (especially) and
# `dfib`.
#yield i * 1023
i += 1
# Average probes for a failing search; e.g.,
# 100 slots; 3 occupied
# 1: 97/100
# 2: 3/100 * 97/99
# 3: 3/100 * 2/99 * 97/98
# 4: 3/100 * 2/99 * 1/98 * 97/97
#
# `total` slots, `filled` occupied
# probability `p` probes will be needed, 1 <= p <= filled+1
# p-1 collisions followed by success:
# ff(filled, p-1) / ff(total, p-1) * (total - filled) / (total - (p-1))
# where `ff` is the falling factorial.
def avgf(total, filled):
assert 0 <= filled < total
ffn = float(filled)
ffd = float(total)
tmf = ffd - ffn
result = 0.0
ffpartial = 1.0
ppartial = 0.0
for p in range(1, filled + 2):
thisp = ffpartial * tmf / (total - (p-1))
ppartial += thisp
result += thisp * p
ffpartial *= ffn / ffd
ffn -= 1.0
ffd -= 1.0
assert abs(ppartial - 1.0) < 1e-14, ppartial
return result
# Average probes for a successful search. Alas, this takes time
# quadratic in `filled`.
def avgs(total, filled):
assert 0 < filled < total
return sum(avgf(total, f) for f in range(filled)) / filled
def pstats(ns):
total = sum(ns.values())
small = min(ns)
print(f"min {small}:{ns[small]/total:.2%} "
f"max {max(ns)} "
f"mean {sum(i * j for i, j in ns.items())/total:.2f} ")
def drive(nbits):
from collections import defaultdict
from itertools import islice
import math
import sys
nslots = 1 << nbits
dlen = nslots * 2 // 3
assert (sys.getsizeof({i: i for i in range(dlen)}) <
sys.getsizeof({i: i for i in range(dlen + 1)}))
alpha = dlen / nslots # actual load factor of max dict
ntodo = (MIN_ELTS + dlen - 1) // dlen
print()
print("bits", nbits,
f"nslots {nslots:,} dlen {dlen:,} alpha {alpha:.2f} "
f"# built {ntodo:,}")
print(f"theoretical avg probes for uniform hashing "
f"when found {math.log(1/(1-alpha))/alpha:.2f} "
f"not found {1/(1-alpha):.2f}")
print(" crisp ", end="")
if nbits > 12:
print("... skipping (slow!)")
else:
print(f"when found {avgs(nslots, dlen):.2f} "
f"not found {avgf(nslots, dlen):.2f}")
for prober in (
linear,
quadratic,
pre28201,
current,
double,
dfib,
uniform,
):
print(" prober", prober.__name__)
objs = objgen()
good = defaultdict(int)
bad = defaultdict(int)
for _ in range(ntodo):
used = spray(nbits, islice(objs, dlen), good, prober)
assert sum(used) == dlen
spray(nbits, islice(objs, nslots), bad, prober, used=used)
print(" " * 8 + "found ", end="")
pstats(good)
print(" " * 8 + "fail ", end="")
pstats(bad)
for bits in range(3, 23):
drive(bits)
More information about the Python-ideas
mailing list