[Python-ideas] Reducing collisions in small dicts/sets

Sat Jun 24 17:29:28 EDT 2017

Short course:  the average number of probes needed when searching
small dicts/sets can be reduced, in both successful ("found") and
failing ("not found") cases.

But I'm not going to pursue this.  This is a brain dump for someone
who's willing to endure the interminable pain of arguing about
benchmarks ;-)

Background:

http://bugs.python.org/issue30671

raised some questions about how dict collisions are handled.  While
the analysis there didn't make sense to me, I wrote enough code to dig
into it.  As detailed in that bug report, the current implementation
appeared to meet the theoretical performance of "uniform hashing",
meaning there was no room left for improvement.

However, that missed something:  the simple expressions for expected
probes under uniform hashing are upper bounds, and while they're
excellent approximations for modest load factors in sizable tables,
for small tables they're significantly overstated.  For example, for a
table with 5 items in 8 slots, the load factor is a = 5/8 = 0.625, and

    avg probes when found = log(1/(1-a))/a = 1.57
           when not found = 1/(1-a)        = 2.67

However, exact analysis gives 1.34 and 2.25 instead.  The current code
achieves the upper bounds, but not the exact values.  As a sanity
check, a painfully slow implementation of uniform hashing does achieve
the exact values.

Code for all this is attached, in a framework that allows you to
easily plug in any probe sequence strategy.  The current strategy is
implemented by generator "current".  There are also implementations of
"linear" probing, "quadratic" probing, "pre28201" probing (the current
strategy before bug 28201 repaired an error in its coding), "uniform"
probing, and ... "double".  The last is one form of "double hashing"
that gets very close to "uniform".

Its loop guts are significantly cheaper than the current scheme, just
1 addition and 1 mask.  However, it requires a non-trivial modulus to
get started, and that's expensive.

Is there a cheaper way to get close to "uniform"?  I don't know - this
was just the best I came up with so far.

Does it matter?  See above ;-)

If you want to pursue this, take these as given:

1. The first probe must be simply the last `nbits` bits of the hash
code.  The speed of the first probe is supremely important, that's the
fastest possible first probe, and it guarantees no collisions at all
for a dict indexed by a contiguous range of integers (an important use
case).

2. The probe sequence must (at least eventually) depend on every bit
in the hash code.  Else it's waaay too easy to stumble into
quadratic-time behavior for "bad" sets of keys, even by accident.

Have fun :-)
-------------- next part --------------
MIN_ELTS = 100_000

M64 = (1 << 64) - 1
def phash(obj, M=M64):  # hash(obj) as uint64
    return hash(obj) & M

# Probers:  generate sequence of table indices to look at,
# in table of size 2**nbits, for object with uint64 hash code h.

def linear(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        i = (i + 1) & mask

# offsets of 0, 1, 3, 6, 10, 15, ...
# this permutes the index range when the table size is a power of 2
def quadratic(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    inc = 1
    while True:
        yield i
        i = (i + inc) & mask
        inc += 1

def pre28201(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        i = (5*i + h + 1) & mask
        h >>= 5

def current(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    while True:
        yield i
        h >>= 5
        i = (5*i + h + 1) & mask

# One version of "double hashing".  The increment between probes is
# fixed, but varies across objects.  This does very well!  Note that the
# increment needs to be relatively prime to the table size so that all
# possible indices are generated.  Because our tables have power-of-2
# sizes, we merely need to ensure the increment is odd.
# Using `h % mask` is akin to "casting out 9's" in decimal:  it's as if
# we broke the hash code into nbits-wide chunks from the right, then
# added them, then repeated that procedure until only one "digit"
# remains. All bits in the hash code affect the result.
# While mod is expensive, a successful search usual gets out on the
# first try, & then the lookup can return before the mod completes.
def double(h, nbits):
    mask = (1 << nbits) - 1
    i = h & mask
    yield i
    inc = (h % mask) | 1    # force it odd
    while True:
        i = (i + inc) & mask
        yield i

# The theoretical "gold standard":  generate a random permutation of the
# table indices for each object.  We can't actually do that, but
# Python's PRNG gets close enough that there's no practical difference.
def uniform(h, nbits):
    from random import seed, randrange
    seed(h)
    n = 1 << nbits
    seen = set()
    while True:
        assert len(seen) < n
        while True:
            i = randrange(n)
            if i not in seen:
                break
        seen.add(i)
        yield i

def spray(nbits, objs, cs, prober, *, used=None, shift=5):
    building = used is None
    nslots = 1 << nbits
    mask = nslots - 1
    if building:
        used = [0] * nslots
    assert len(used) == nslots
    for o in objs:
        n = 1
        for i in prober(phash(o), nbits):
            if used[i]:
                n += 1
            else:
                break
        if building:
            used[i] = 1
        cs[n] += 1
    return used

def objgen(i=1):
    while True:
        yield str(i)
        i += 1

# Average probes for a failing search; e.g.,
# 100 slots; 3 occupied
# 1:                        97/100
# 2:  3/100 *               97/99
# 3:  3/100 * 2/99 *        97/98
# 4:  3/100 * 2/99 * 1/98 * 97/97
#
# `total` slots, `filled` occupied
# probability `p` probes will be needed, 1 <= p <= filled+1
# p-1 collisions followed by success:
#     ff(filled, p-1) / ff(total, p-1) * (total - filled) / (total - (p-1))
# where `ff` is the falling factorial.
def avgf(total, filled):
    assert 0 <= filled < total
    ffn = float(filled)
    ffd = float(total)
    tmf = ffd - ffn
    result = 0.0
    ffpartial = 1.0
    ppartial = 0.0
    for p in range(1, filled + 2):
        thisp = ffpartial * tmf / (total - (p-1))
        ppartial += thisp
        result += thisp * p
        ffpartial *= ffn / ffd
        ffn -= 1.0
        ffd -= 1.0
    assert abs(ppartial - 1.0) < 1e-14, ppartial
    return result

# Average probes for a successful search.  Alas, this takes time
# quadratic in `filled`.
def avgs(total, filled):
    assert 0 < filled < total
    return sum(avgf(total, f) for f in range(filled)) / filled

def pstats(ns):
    total = sum(ns.values())
    small = min(ns)
    print(f"min {small}:{ns[small]/total:.2%} "
          f"max {max(ns)} "
          f"mean {sum(i * j for i, j in ns.items())/total:.2f} ")

def drive(nbits):
    from collections import defaultdict
    from itertools import islice
    import math
    import sys

    nslots = 1 << nbits
    dlen = nslots * 2 // 3
    assert (sys.getsizeof({i: i for i in range(dlen)}) <
            sys.getsizeof({i: i for i in range(dlen + 1)}))
    alpha = dlen / nslots # actual load factor of max dict
    ntodo = (MIN_ELTS + dlen - 1) // dlen
    print()
    print("bits", nbits,
          f"nslots {nslots:,} dlen {dlen:,} alpha {alpha:.2f} "
          f"# built {ntodo:,}")
    print(f"theoretical avg probes for uniform hashing "
          f"when found {math.log(1/(1-alpha))/alpha:.2f} "
          f"not found {1/(1-alpha):.2f}")
    print("                                     crisp ", end="")
    if nbits > 12:
        print("... skipping (slow!)")
    else:
        print(f"when found {avgs(nslots, dlen):.2f} "
              f"not found {avgf(nslots, dlen):.2f}")

    for prober in (linear, quadratic, pre28201, current,
                   double, uniform):
        print("    prober", prober.__name__)
        objs = objgen()
        good = defaultdict(int)
        bad = defaultdict(int)
        for _ in range(ntodo):
            used = spray(nbits, islice(objs, dlen), good, prober)
            assert sum(used) == dlen
            spray(nbits, islice(objs, dlen), bad, prober, used=used)
        print(" " * 8 + "found ", end="")
        pstats(good)
        print(" " * 8 + "fail  ", end="")
        pstats(bad)

for bits in range(3, 23):
    drive(bits)