[Python-ideas] Reducing collisions in small dicts/sets

Tim Peters tim.peters at gmail.com
Mon Jun 26 23:18:04 EDT 2017


[Tim]
>...  I haven't yet thought of a cheap way to compute an
> `inc` for double-hashing that isn't vulnerable to bad behavior for
> _some_ easily constructed set of int keys.  If you forget "cheap",
> it's easy; e.g.,
>
>     random.seed(h)
>     inc = random.choice(range(1, mask + 1, 2))

Heh.  I always tell people not to believe what they think:  run code
to make sure!  So I did ;-)

def duni(h, nbits):
    from random import seed, choice
    mask = (1 << nbits) - 1
    i = h & mask
    yield i
    seed(h)
    inc = choice(range(1, mask + 1, 2))
    while True:
        i = (i + inc) & mask
        yield i

On the terrible-for-good-reasons-for-failing-`double`-1023*i-searches
case, it turned out `duni` did just as bad as `dfib`:

"""
bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1
theoretical avg probes for uniform hashing when found 1.65 not found 3.00
    ...
                                 crisp ... skipping (slow!)
    prober current
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 34 mean 3.04
    prober double
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 699049 mean 1867.51
    prober dfib
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 427625 mean 8.09
    prober duni
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 433846 mean 9.84
    prober uniform
        found min 1:66.65% max 24 mean 1.65
        fail  min 1:33.35% max 35 mean 3.00

Yikes!  That really surprised me.  So - are integers of the form
1023*i so magical they also manage to provoke random.seed() into
amazingly regular behavior?  Or is there something about the whole
idea of "double hashing" that leaves it vulnerable no matter how
"random" an odd increment it picks?

It's not the former.  More code showed that a great number of distinct
`inc` values are in fact used.

And double hashing is widely & successfully used in other contexts. so
it's not a problem with the idea _in general_.

What does appear to be the case:  it doesn't always play well with
taking the last `nbits` bits as the initial table index.  In other
double-hashing contexts, they strive to pick a pseudo-random initial
table index too.

Why this matters:  for any odd integer N,

    N*i = N*j (mod 2**k)

if and only if

    i = j (mod 2**k)

So, when integers are their own hash codes, for any `i` all the values in

    range(i*N, i*N + N*2**k, N)

hash to unique slots in a table with 2**k slots (actually any table
with 2**m slots for any m >= k).  In particular, mod 2**k they map to
the slots

    i*N + 0*N
    i*N + 1*N
    i*N + 2*N
    i*N + 3*N
   ...

So, on a failing search for an integer of the form j*N, whenever
double-hashing picks an increment that happens to be a multiple of N,
it can bump into a huge chain of collisions, because double-hashing
also uses a fixed increment between probes.  And this is so for any
odd N.  For example, using a "yield i * 11" generator:

"""
bits 20 nslots 1,048,576 dlen 699,050 alpha 0.67 # built 1
theoretical avg probes for uniform hashing when found 1.65 not found 3.00
    ...
    prober current
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 34 mean 3.00
    prober double
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 667274 mean 34.10
    prober dfib
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 539865 mean 9.30
    prober duni
        found min 1:100.00% max 1 mean 1.00
        fail  min 1:33.33% max 418562 mean 8.65
    prober uniform
        found min 1:66.63% max 25 mean 1.65
        fail  min 1:33.31% max 32 mean 3.00
"""

All three double-hashing variants have horrible max collision chains,
for the reasons just explained.  In "normal" double-hashing contexts,
the initial table indices are scrambled; it's my fault they follow a
(mod 2**k) arithmetic progression in Python.

Which fault I gladly accept ;-)  It's valuable in practice.

But, so long as that stays, it kills the double-hashing idea for me:
while it's great for random-ish hash codes, the worst cases are
horribly bad and very easily provoked (even by accident).


More information about the Python-ideas mailing list