[Python-Dev] Hash collision security issue (now public)

Victor Stinner victor.stinner at haypocalc.com
Sun Jan 1 18:32:51 CET 2012

Le 01/01/2012 04:29, Paul McMillan a écrit :
> This is incorrect. Once an attacker has guessed the random seed, any
> operation which reveals the ordering of hashed objects can be used to
> verify the answer. JSON responses would be ideal. In fact, an attacker
> can do a brute-force attack of the random seed offline. Once they have
> the seed, generating collisions is a fast process.

If we want to protect a website against this attack for example, we must 
suppose that the attacker can inject arbitrary data and can get 
(indirectly) the result of hash(str) (e.g. with the representation of a 
dict in a traceback, with a JSON output, etc.).

> The goal isn't perfection, but we need to do better than a simple
> salt.

I disagree. I don't want to break backward compatibility and have a 
hash() function different for each process, if the change is not an 
effective protection against the "hash vulnerability".

It's really hard to write a good (secure) hash function: see for example 
the recent NIST competition (started in 2008, will end this year). Even 
good security researcher are unable to write a strong and fast hash 
function. It's easy to add a weakness in the function if you don't have 
a good background in cryptography. The NIST competition gives 4 years to 
analyze new hash functions. We should not rush to add a quick "hack" if 
it doesn't solve correctly the problem (protect against a collision 
attack and preimage attack).


Runtime performance does matter, I'm not completly sure that changing 
Python is the best place to add a countermeasure against a 
vulnerability. I don't want to slow down numpy for a web vulnerability. 
Because there are different use cases, a better compromise is maybe to 
add a runtime option to use a secure hash function, and keep the unsafe 
but fast hash function by default.

> I propose we modify the string hash function like this:
> https://gist.github.com/0a91e52efa74f61858b5

Always allocate 2**21 bytes just to workaround one specific kind of 
attack is not acceptable. I suppose that the maximum acceptable is 4096 
bytes (or better 256 bytes).

Crytographic hash functions don't need random data, why would Python 
need 2 MB (!) for its hash function?


More information about the Python-Dev mailing list