[Python-Dev] Hash collision security issue (now public)
victor.stinner at haypocalc.com
Sat Dec 31 03:22:24 CET 2011
Le 29/12/2011 02:28, Michael Foord a écrit :
> A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included:
This PDF doesn't explain exactly the problem and how it can be solved.
Let's try to summarize this "vulnerability".
The creation of a Python dictionary has a complexity of O(n) in most
cases, but O(n^2) in the *worst* case. The attack tries to go into the
worst case. It requires to compute a set of N keys having the same hash
value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to
compute these keys once. It looks like it is now cheap enough in
practice to compute this dataset for Python (and other languages).
A countermeasure would be to check that we don't have more than X keys
with the same hash value. But in practice, we don't know in advance how
data are processed, and there are too many input vectors in various formats.
If we want to fix something, it should be done in the implementation of
the dict type or in the hash algorithm. We can implement dict
differently to avoid this issue, using a binary tree for example.
Because dict is a fundamental type in Python, I don't think that we can
change its implementation (without breaking backward compatibility and
so applications in production). A possibility would be to add a *new*
type, but all libraries and applications would need to be changed to fix
The last choice is to change the hash algorithm. The *idea* is the same
than adding salt to hashed password (in practice it will be a little bit
different): if a pseudo-random salt is added, the attacker cannot
prepare a single dataset, he/she will have to regenerate a new dataset
for each possible salt value. If the salt is big enough (size in bits),
the attacker will need too much CPU to generate the dataset (compute N
keys with the same hash value). Basically, it slows down the attack by
2^(size of the salt).
Another possibility would be to replace our fast hash function by a
better hash function like MD5 or SHA1 (so the creation of the dataset
would be too slow in practice = too expensive), but cryptographic hash
functions are much slower (and so would slow down Python too much).
Limiting the size of the POST data doesn't solve the problem because
there are many other input vectors and data formats. It may block the
most simple attacks because the attacker cannot inject enough keys to
slow down your CPU.
More information about the Python-Dev