Le 29/12/2011 02:28, Michael Foord a écrit :
A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included:
This PDF doesn't explain exactly the problem and how it can be solved. Let's try to summarize this "vulnerability".
The creation of a Python dictionary has a complexity of O(n) in most cases, but O(n^2) in the *worst* case. The attack tries to go into the worst case. It requires to compute a set of N keys having the same hash value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to compute these keys once. It looks like it is now cheap enough in practice to compute this dataset for Python (and other languages).
A countermeasure would be to check that we don't have more than X keys with the same hash value. But in practice, we don't know in advance how data are processed, and there are too many input vectors in various formats.
If we want to fix something, it should be done in the implementation of the dict type or in the hash algorithm. We can implement dict differently to avoid this issue, using a binary tree for example. Because dict is a fundamental type in Python, I don't think that we can change its implementation (without breaking backward compatibility and so applications in production). A possibility would be to add a *new* type, but all libraries and applications would need to be changed to fix the vulnerability.
The last choice is to change the hash algorithm. The *idea* is the same than adding salt to hashed password (in practice it will be a little bit different): if a pseudo-random salt is added, the attacker cannot prepare a single dataset, he/she will have to regenerate a new dataset for each possible salt value. If the salt is big enough (size in bits), the attacker will need too much CPU to generate the dataset (compute N keys with the same hash value). Basically, it slows down the attack by 2^(size of the salt).
Another possibility would be to replace our fast hash function by a better hash function like MD5 or SHA1 (so the creation of the dataset would be too slow in practice = too expensive), but cryptographic hash functions are much slower (and so would slow down Python too much).
Limiting the size of the POST data doesn't solve the problem because there are many other input vectors and data formats. It may block the most simple attacks because the attacker cannot inject enough keys to slow down your CPU.