[Python-ideas] Reducing collisions in small dicts/sets

Mon Jun 26 14:21:03 EDT 2017

Some explanations and cautions.  An advantage of sticking with pure
Python instead of C is that spare moments are devoted to investigating
the problem instead of thrashing with micro-optimizations ;-)

Why does the current scheme suffer for small tables?  With hindsight
it's pretty obvious:  it can visit table slots more than once (the
other schemes cannot), and while the odds of that happening are tiny
in large tables, they're high for small tables.

For example, in a 3-bit table (8 slots), suppose the initial index is
1, and there's a collision.  The 5*i+1 recurrence _would_ visit slot 6
next, but we also shift in 5 new bits and add them.  If those bits are
"random", there's a 1-in-8 chance that we'll visit index 1 _again_.
If we don't, and have another collision, shifting in the next 5 bits
has a 2-in-8 chance of repeating one of the two slots we already
visited.  And so on.  Here using the `str(i)` generator (which yields
random-ish hash codes):

"""
bits 3 nslots 8 dlen 5 alpha 0.62 # built 20,000
    ...
    prober current
        found min 1:74.80% max 15 mean 1.42
        fail  min 1:37.62% max 18 mean 2.66
"""

So despite that only 5 slots are filled, in at least one case it took
15 probes to find an existing key, and 18 probes to realize a key was
missing.  In the other schemes, it takes at most 5 probes for success
and 6 probes for failure.

This is worse for 64-bit hash codes than for 32-bit ones, because we
can loop around twice as often before `perturb` permanently becomes 0
(at which point the pure 5*i+1 recurrence visits each slot at most
once).

The larger the table, the less likely a repeat due to `perturb`
becomes.  For example, suppose we have an 8-bit table and again visit
index 1 first.  We may visit any index in range(6, 6+32) next
(depending on the 5 fresh bits shifted in), but repeating 1 is _not_ a
possibility.

Why do the double hashing methods do better for the `i << 12`
generator?  In any case where the hash codes have "a lot" of trailing
bits in common, they all map to the same table index at first, and the
probe sequence remains the same for all until the loop shifts
`perturb` far enough right that the rightmost differing bits finally
show up in the addition.

In the double hashing methods, _all_ the bits of the hash code affect
the value of `inc` computed before the loop starts, so they have a
good chance of differing already on the second probe.

This is again potentially worse for the current scheme with 64-bit
hash codes, since the sheer number of common trailing bits _can_ be up
to 63.

However, the double-hashing schemes have pathological cases too, that
the current scheme avoids.  The first I tried was a `yield i * 1023`
generator.  These are spectacularly _good_ values for all schemes
except `uniform` for successful searches, because

    i*1023 = j*1023 (mod 2**k)

implies

    i=j (mod 2**k)

That is, there are no first-probe collisions at all across a
contiguous range of `i` with no more than 2**k values.  But the
`double` scheme for a 10-bit table degenerates to linear probing in
this case, because

    inc = (h % mask) | 1    # force it odd

is always 1 when h is divisible by 1023 (== mask for a 10-bit table).
This is terrible for a failing search; e.g., for a 20-bit table (note
that 2**20-1 is divisible by 1023):

"""
bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1
theoretical avg probes for uniform hashing when found 1.65 not found 3.00
...
    prober current
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 34 mean 3.04
    prober double
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 699049 mean 1867.51
    prober dfib
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 427625 mean 8.09
    prober uniform
        found min 1:66.65% max 24 mean 1.65
        fail  min 1:33.35% max 35 mean 3.00
"""

While that's a failing-search disaster for `double`, it's also bad for
`dfib` (& I don't have a slick explanation for that).

So where does that leave us?  I don't know.  All schemes have good &
bad points ;-)  I haven't yet thought of a cheap way to compute an
`inc` for double-hashing that isn't vulnerable to bad behavior for
_some_ easily constructed set of int keys.  If you forget "cheap",
it's easy; e.g.,

    random.seed(h)
    inc = random.choice(range(1, mask + 1, 2))

Regardless, I'll attach the current version of the code.
-------------- next part --------------
MIN_ELTS = 100_000

M64 = (1 << 64) - 1
def phash(obj, M=M64):  # hash(obj) as uint64
    return hash(obj) & M

# Probers:  generate sequence of table indices to look at,
# in table of size 2**nbits, for object with uint64 hash code h.

def linear(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        i = (i + 1) & mask

# offsets of 0, 1, 3, 6, 10, 15, ...
# this permutes the index range when the table size is a power of 2
def quadratic(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    inc = 1
    while True:
        yield i
        i = (i + inc) & mask
        inc += 1

def pre28201(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        i = (5*i + h + 1) & mask
        h >>= 5

def current(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        h >>= 5
        i = (5*i + h + 1) & mask

# One version of "double hashing".  The increment between probes is
# fixed, but varies across objects.  This does very well!  Note that the
# increment needs to be relatively prime to the table size so that all
# possible indices are generated.  Because our tables have power-of-2
# sizes, we merely need to ensure the increment is odd.
# Using `h % mask` is akin to "casting out 9's" in decimal:  it's as if
# we broke the hash code into nbits-wide chunks from the right, then
# added them, then repeated that procedure until only one "digit"
# remains.  All bits in the hash code affect the result.
# While mod is expensive, a successful search usual gets out on the
# first try, & then the lookup can return before the mod completes.
def double(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    yield i
    inc = (h % mask) | 1    # force it odd
    while True:
        i = (i + inc) & mask
        yield i

# Double hashing using Fibonacci multiplication for the increment.  This
# does about as well as `double`, but doesn't require division.
#
# The multiplicative constant depends on the word size W, and is the
# nearest odd integer to 2**W/((1 + sqrt(5))/2).  So for a 64-bit box:
#
# >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2)
# Decimal('11400714819323198485.95161059')
#
# For a 32-bit box, it's 2654435769.
#
# The result is the high-order `nbits` bits of the low-order W bits of
# the product.  In C, the "& M" part isn't needed (unsigned * in C
# returns only the low-order bits to begin with).
#
# Annoyance:  I don't think Python dicts store `nbits`, just 2**nbits.
def dfib(h, nbits, M=M64):
    mask = (1 << nbits) - 1
    i = h & mask
    yield i
    inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1
    while True:
        i = (i + inc) & mask
        yield i

# The theoretical "gold standard":  generate a random permutation of the
# table indices for each object.  We can't actually do that, but
# Python's PRNG gets close enough that there's no practical difference.
def uniform(h, nbits):
    from random import seed, randrange
    seed(h)
    n = 1 << nbits
    seen = set()
    while True:
        assert len(seen) < n
        while True:
            i = randrange(n)
            if i not in seen:
                break
        seen.add(i)
        yield i

def spray(nbits, objs, cs, prober, *, used=None, shift=5):
    building = used is None
    nslots = 1 << nbits
    if building:
        used = [0] * nslots
    assert len(used) == nslots
    for o in objs:
        n = 1
        for i in prober(phash(o), nbits):
            if used[i]:
                n += 1
            else:
                break
        if building:
            used[i] = 1
        cs[n] += 1
    return used

def objgen(i=1):
    while True:
        yield str(i)

        #yield i << 12

        # i*1023 gives a unique first probe, but is deadly
        # on failing searches for `double` (especially) and
        # `dfib`.
        #yield i * 1023

        i += 1

# Average probes for a failing search; e.g.,
# 100 slots; 3 occupied
# 1:                        97/100
# 2:  3/100 *               97/99
# 3:  3/100 * 2/99 *        97/98
# 4:  3/100 * 2/99 * 1/98 * 97/97
#
# `total` slots, `filled` occupied
# probability `p` probes will be needed, 1 <= p <= filled+1
# p-1 collisions followed by success:
#     ff(filled, p-1) / ff(total, p-1) * (total - filled) / (total - (p-1))
# where `ff` is the falling factorial.
def avgf(total, filled):
    assert 0 <= filled < total
    ffn = float(filled)
    ffd = float(total)
    tmf = ffd - ffn
    result = 0.0
    ffpartial = 1.0
    ppartial = 0.0
    for p in range(1, filled + 2):
        thisp = ffpartial * tmf / (total - (p-1))
        ppartial += thisp
        result += thisp * p
        ffpartial *= ffn / ffd
        ffn -= 1.0
        ffd -= 1.0
    assert abs(ppartial - 1.0) < 1e-14, ppartial
    return result

# Average probes for a successful search.  Alas, this takes time
# quadratic in `filled`.
def avgs(total, filled):
    assert 0 < filled < total
    return sum(avgf(total, f) for f in range(filled)) / filled

def pstats(ns):
    total = sum(ns.values())
    small = min(ns)
    print(f"min {small}:{ns[small]/total:.2%} "
          f"max {max(ns)} "
          f"mean {sum(i * j for i, j in ns.items())/total:.2f} ")

def drive(nbits):
    from collections import defaultdict
    from itertools import islice
    import math
    import sys

    nslots = 1 << nbits
    dlen = nslots * 2 // 3
    assert (sys.getsizeof({i: i for i in range(dlen)}) <
            sys.getsizeof({i: i for i in range(dlen + 1)}))
    alpha = dlen / nslots # actual load factor of max dict
    ntodo = (MIN_ELTS + dlen - 1) // dlen
    print()
    print("bits", nbits,
          f"nslots {nslots:,} dlen {dlen:,} alpha {alpha:.2f} "
          f"# built {ntodo:,}")
    print(f"theoretical avg probes for uniform hashing "
          f"when found {math.log(1/(1-alpha))/alpha:.2f} "
          f"not found {1/(1-alpha):.2f}")
    print("                                     crisp ", end="")
    if nbits > 12:
        print("... skipping (slow!)")
    else:
        print(f"when found {avgs(nslots, dlen):.2f} "
              f"not found {avgf(nslots, dlen):.2f}")

    for prober in (
          linear,
          quadratic,
          pre28201,
          current,
          double,
          dfib,
          uniform,
        ):
        print("    prober", prober.__name__)
        objs = objgen()
        good = defaultdict(int)
        bad = defaultdict(int)
        for _ in range(ntodo):
            used = spray(nbits, islice(objs, dlen), good, prober)
            assert sum(used) == dlen
            spray(nbits, islice(objs, nslots), bad, prober, used=used)
        print(" " * 8 + "found ", end="")
        pstats(good)
        print(" " * 8 + "fail  ", end="")
        pstats(bad)

for bits in range(3, 23):
    drive(bits)