Mailman 3 Reducing collisions in small dicts/sets - Python-ideas

June 24, 2017

      Short course:  the average number of probes needed when searching
small dicts/sets can be reduced, in both successful ("found") and
failing ("not found") cases.

But I'm not going to pursue this.  This is a brain dump for someone
who's willing to endure the interminable pain of arguing about
benchmarks ;-)

Background:

http://bugs.python.org/issue30671

raised some questions about how dict collisions are handled.  While
the analysis there didn't make sense to me, I wrote enough code to dig
into it.  As detailed in that bug report, the current implementation
appeared to meet the theoretical performance of "uniform hashing",
meaning there was no room left for improvement.

However, that missed something:  the simple expressions for expected
probes under uniform hashing are upper bounds, and while they're
excellent approximations for modest load factors in sizable tables,
for small tables they're significantly overstated.  For example, for a
table with 5 items in 8 slots, the load factor is a = 5/8 = 0.625, and

    avg probes when found = log(1/(1-a))/a = 1.57
           when not found = 1/(1-a)        = 2.67

However, exact analysis gives 1.34 and 2.25 instead.  The current code
achieves the upper bounds, but not the exact values.  As a sanity
check, a painfully slow implementation of uniform hashing does achieve
the exact values.

Code for all this is attached, in a framework that allows you to
easily plug in any probe sequence strategy.  The current strategy is
implemented by generator "current".  There are also implementations of
"linear" probing, "quadratic" probing, "pre28201" probing (the current
strategy before bug 28201 repaired an error in its coding), "uniform"
probing, and ... "double".  The last is one form of "double hashing"
that gets very close to "uniform".

Its loop guts are significantly cheaper than the current scheme, just
1 addition and 1 mask.  However, it requires a non-trivial modulus to
get started, and that's expensive.

Is there a cheaper way to get close to "uniform"?  I don't know - this
was just the best I came up with so far.

Does it matter?  See above ;-)

If you want to pursue this, take these as given:

1. The first probe must be simply the last `nbits` bits of the hash
code.  The speed of the first probe is supremely important, that's the
fastest possible first probe, and it guarantees no collisions at all
for a dict indexed by a contiguous range of integers (an important use
case).

2. The probe sequence must (at least eventually) depend on every bit
in the hash code.  Else it's waaay too easy to stumble into
quadratic-time behavior for "bad" sets of keys, even by accident.

Have fun :-)

Reducing collisions in small dicts/sets

Tim Peters

Tim Peters

Tim Peters

MRAB

Tim Peters

Tim Peters

Tim Peters

Tim Peters

MRAB

Tim Peters

Tim Peters

tags

participants (2)