[Python-Dev] Hash collision security issue (now public)

Sun Jan 1 17:20:34 CET 2012

Am 01.01.2012 06:57, schrieb Paul McMillan:
> I agree that doing anything is better than doing nothing. If we use
> the earlier suggestion and prepend everything with a fixed-length
> seed, we need quite a bit of entropy (and so a fairly long string) to
> make a lasting difference.

Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2
MB (2**21 - 1) data from urandom. I'm worried that this is going to
exhaust the OS's random pool and suck it dry. We shouldn't forget that
Python is used for long running processes as well as short scripts. Your
suggestion also increases the process size by 2 MB which is going to be
an issue for mobile and embedded platforms.

How about this:

r = [ord(i) for i in os.urandom(256)]
rs = os.urandom(4) # or 8 ?
seed = rs[-1] + (rs[-2] << 8) + (rs[-3] << 16) + (rs[-4] << 24)

def _hash_string(s):
    """The algorithm behind compute_hash() for a string or a unicode."""
    from pypy.rlib.rarithmetic import intmask
    length = len(s)
    if length == 0:
        return -1
    x = intmask(seed + (ord(s[0]) << 7))
    i = 0
    while i < length:
        o = ord(s[i])
        x = intmask((1000003*x) ^ o ^ r[o % 0xff]
        i += 1
    x ^= length
    return intmask(x)

This combines a random seed for the hash with your suggestion.

We also need to special case short strings. The above routine hands over
the seed to attackers, if he is able to retrieve lots of single
character hashes. The randomization shouldn't be used if we can prove
that it's not possible to create hash collisions for strings shorter
than X. For example 64bit FNV-1 has no collisions for 8 chars or less,
32bit FNV has no collisions for 4 or less cars.

Christian

Christian