[Python-Dev] Hash collision security issue (now public)

Sat Dec 31 04:59:41 CET 2011

Am 31.12.2011 03:22, schrieb Victor Stinner:
> The creation of a Python dictionary has a complexity of O(n) in most 
> cases, but O(n^2) in the *worst* case. The attack tries to go into the 
> worst case. It requires to compute a set of N keys having the same hash 
> value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to 
> compute these keys once. It looks like it is now cheap enough in 
> practice to compute this dataset for Python (and other languages).

Correct. The meet-in-the-middle attack and the unique properties of
algorithms that are similar to DJBX33A and DJBX33A make the attack easy
on platforms with 32bit hash.

> A countermeasure would be to check that we don't have more than X keys 
> with the same hash value. But in practice, we don't know in advance how 
> data are processed, and there are too many input vectors in various formats.
> 
> If we want to fix something, it should be done in the implementation of 
> the dict type or in the hash algorithm. We can implement dict 
> differently to avoid this issue, using a binary tree for example. 
> Because dict is a fundamental type in Python, I don't think that we can 
> change its implementation (without breaking backward compatibility and 
> so applications in production). A possibility would be to add a *new* 
> type, but all libraries and applications would need to be changed to fix 
> the vulnerability.

A BTree is too slow for common operations, it's O(log n) instead of O(1)
in average. We can't replace our dict with a btree type. A new btree
type is a lot of work, too.

The unique structure of CPython's dict implementation makes it harder to
get the number of values with equal hash. The academic hash map (the one
I learnt about at university) uses a bucket to store all elements with
equal hash (more precise hash: mod mask). However Python's dict however
perturbs the hash until it finds a free slot its array. The second,
third ... collision can be caused by a legit and completely different
(!) hash.

> The last choice is to change the hash algorithm. The *idea* is the same 
> than adding salt to hashed password (in practice it will be a little bit 
> different): if a pseudo-random salt is added, the attacker cannot 
> prepare a single dataset, he/she will have to regenerate a new dataset 
> for each possible salt value. If the salt is big enough (size in bits), 
> the attacker will need too much CPU to generate the dataset (compute N 
> keys with the same hash value). Basically, it slows down the attack by 
> 2^(size of the salt).

That's the idea of randomized hashing functions as implemented by Ruby
1.8, Perl and others. The random seed is used as IV. Multiple rounds of
multiply, XOR and MOD (integer overflows) cause a deviation. In your
other posting you were worried about the performance implication. A
randomized hash function just adds a single ADD operation, that's all.

Downside: With randomization all hashes are unpredictable and change
after every restart of the interpreter. This has some subtle side
effects like a different outcome of {a:1, b:1, c:1}.keys() after a
restart of the interpreter.

> Another possibility would be to replace our fast hash function by a 
> better hash function like MD5 or SHA1 (so the creation of the dataset 
> would be too slow in practice = too expensive), but cryptographic hash 
> functions are much slower (and so would slow down Python too much).

I agree with your analysis. Cryptographic hash functions are far too
slow for our use case. During my research I found another hash function
that claims to be fast and that may not be vulnerable to this kind of
attack: http://isthe.com/chongo/tech/comp/fnv/

Christian