# [Python-Dev] Hash collision security issue (now public)

Christian Heimes lists at cheimes.de
Sun Jan 1 17:20:34 CET 2012

```Am 01.01.2012 06:57, schrieb Paul McMillan:
> I agree that doing anything is better than doing nothing. If we use
> the earlier suggestion and prepend everything with a fixed-length
> seed, we need quite a bit of entropy (and so a fairly long string) to
> make a lasting difference.

MB (2**21 - 1) data from urandom. I'm worried that this is going to
exhaust the OS's random pool and suck it dry. We shouldn't forget that
Python is used for long running processes as well as short scripts. Your
suggestion also increases the process size by 2 MB which is going to be
an issue for mobile and embedded platforms.

r = [ord(i) for i in os.urandom(256)]
rs = os.urandom(4) # or 8 ?
seed = rs[-1] + (rs[-2] << 8) + (rs[-3] << 16) + (rs[-4] << 24)

def _hash_string(s):
"""The algorithm behind compute_hash() for a string or a unicode."""
length = len(s)
if length == 0:
return -1
x = intmask(seed + (ord(s[0]) << 7))
i = 0
while i < length:
o = ord(s[i])
x = intmask((1000003*x) ^ o ^ r[o % 0xff]
i += 1
x ^= length

This combines a random seed for the hash with your suggestion.

We also need to special case short strings. The above routine hands over
the seed to attackers, if he is able to retrieve lots of single
character hashes. The randomization shouldn't be used if we can prove
that it's not possible to create hash collisions for strings shorter
than X. For example 64bit FNV-1 has no collisions for 8 chars or less,
32bit FNV has no collisions for 4 or less cars.

Christian

Christian
```