[Python-Dev] Hash collision security issue (now public)

Sat Jan 7 12:02:04 CET 2012

Christian Heimes, 31.12.2011 04:59:
> Am 31.12.2011 03:22, schrieb Victor Stinner:
> The unique structure of CPython's dict implementation makes it harder to
> get the number of values with equal hash. The academic hash map (the one
> I learnt about at university) uses a bucket to store all elements with
> equal hash (more precise hash: mod mask). However Python's dict however
> perturbs the hash until it finds a free slot its array. The second,
> third ... collision can be caused by a legit and completely different
> (!) hash.
> 
>> The last choice is to change the hash algorithm. The *idea* is the same 
>> than adding salt to hashed password (in practice it will be a little bit 
>> different): if a pseudo-random salt is added, the attacker cannot 
>> prepare a single dataset, he/she will have to regenerate a new dataset 
>> for each possible salt value. If the salt is big enough (size in bits), 
>> the attacker will need too much CPU to generate the dataset (compute N 
>> keys with the same hash value). Basically, it slows down the attack by 
>> 2^(size of the salt).
> 
> That's the idea of randomized hashing functions as implemented by Ruby
> 1.8, Perl and others. The random seed is used as IV. Multiple rounds of
> multiply, XOR and MOD (integer overflows) cause a deviation. In your
> other posting you were worried about the performance implication. A
> randomized hash function just adds a single ADD operation, that's all.
> 
> Downside: With randomization all hashes are unpredictable and change
> after every restart of the interpreter. This has some subtle side
> effects like a different outcome of {a:1, b:1, c:1}.keys() after a
> restart of the interpreter.
> 
>> Another possibility would be to replace our fast hash function by a 
>> better hash function like MD5 or SHA1 (so the creation of the dataset 
>> would be too slow in practice = too expensive), but cryptographic hash 
>> functions are much slower (and so would slow down Python too much).
> 
> I agree with your analysis. Cryptographic hash functions are far too
> slow for our use case. During my research I found another hash function
> that claims to be fast and that may not be vulnerable to this kind of
> attack: http://isthe.com/chongo/tech/comp/fnv/

Wouldn't Bob Jenkins' "lookup3" hash function fit in here? After all, it's
portable, known to provide a very good distribution for different string
values and is generally fast on both 32 and 64 bit architectures.

http://burtleburtle.net/bob/c/lookup3.c

The analysis is here:

http://burtleburtle.net/bob/hash/doobs.html

It seems that there's also support for generating 64bit hash values
(actually 2x32bits) efficiently.

Admittedly, this may require some adaptation for the PEP393 unicode memory
layout in order to produce identical hashes for all three representations
if they represent the same content. So it's not a drop-in replacement.

Stefan