[Python-Dev] Counting collisions for the win

Fri Jan 20 01:48:53 CET 2012

Hi,

I'm working on the hash collision issue since 2 or 3 weeks. I
evaluated all solutions and I think that I have now a good knowledge
of the problem and how it should be solved. The major issue is to have
a minor or no impact on applications (don't break backward
compatibility). I saw three major solutions:

 - use a randomized hash
 - use two hashes, a randomized hash and the actual hash kept for
backward compatibility
 - count collisions on dictionary lookup

Using a randomized hash does break a lot of tests (e.g. tests relying
on the representation of a dictionary). The patch is huge, too big to
backport it directly on stable versions. Using a randomized hash may
also break (indirectly) real applications because the application
output is also somehow "randomized". For example, in the Django test
suite, the HTML output is different at each run. Web browsers may
render the web page differently, or crash, or ... I don't think that
Django would like to sort attributes of each HTML tag, just because we
wanted to fix a vulnerability.

Randomized hash has also a major issue: if the attacker is able to
compute the secret, (s)he can easily compute collisions and exploit
the hash collision vulnerability again. I don't know exactly how
complex it is to compute the secret, but our hash function is weak (it
is far from being cryptographic, it is really simple to run it
backward). If someone writes a fast function to compute the secret, we
will go back to the same point.

IMO using two hashes has the same disavantages of the randomized hash
solution, whereas it is more complex to implement.

The last solution is very simple: count collision and raise an
exception if it hits a limit. The path is something like 10 lines
whereas the randomized hash is more close to 500 lines, add a new
file, change Visual Studio project file, etc. First I thaught that it
would break more applications than the randomized hash, but I tried on
Django: the test suite fails with a limit of 20 collisions, but not
with a limit of 50 collisions, whereas the patch uses a limit of 1000
collisions. According to my basic tests, a limit of 35 collisions
requires a dictionary with more than 10,000,000 integer keys to raise
an error. I am not talking about the attack, but valid data.

More details about my tests on the Django test suite:
http://bugs.python.org/issue13703#msg151620

--

I propose to solve the hash collision vulnerability by counting
collisions because it does fix the vulnerability with a minor or no
impact on applications or backward compatibility. I don't see why we
should use a different fix for Python 3.3. If counting collisons
solves the issue for stable versions, it is also enough for Python
3.3. We now know all issues of the randomized hash solution, and I
think that there are more drawbacks than advantages. IMO the
randomized hash is overkill to fix the hash collision issue.

I just have some requests on Marc Andre Lemburg patch:

 - the limit should be configurable: a new function in the sys module
should be enough. It may be private (or replaced by an environment
variable?) in stable versions
 - the set type should also be patched (I didn't check if it is
vulnerable or not using the patch)
 - the patch has no test! (a class with a fixed hash should be enough
to write a test)
 - the limit must be documented somwhere
 - the exception type should be different than KeyError

Victor