<div class="gmail_quote">On Thu, Jan 19, 2012 at 4:48 PM, Victor Stinner <span dir="ltr">&lt;<a href="mailto:victor.stinner@haypocalc.com">victor.stinner@haypocalc.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Hi,<br>

<br>

I&#39;m working on the hash collision issue since 2 or 3 weeks. I<br>

evaluated all solutions and I think that I have now a good knowledge<br>

of the problem and how it should be solved. The major issue is to have<br>

a minor or no impact on applications (don&#39;t break backward<br>

compatibility). I saw three major solutions:<br>

<br>

 - use a randomized hash<br>

 - use two hashes, a randomized hash and the actual hash kept for<br>

backward compatibility<br>

 - count collisions on dictionary lookup<br>

<br>

Using a randomized hash does break a lot of tests (e.g. tests relying<br>

on the representation of a dictionary). The patch is huge, too big to<br>

backport it directly on stable versions. Using a randomized hash may<br>

also break (indirectly) real applications because the application<br>

output is also somehow &quot;randomized&quot;. For example, in the Django test<br>

suite, the HTML output is different at each run. Web browsers may<br>

render the web page differently, or crash, or ... I don&#39;t think that<br>

Django would like to sort attributes of each HTML tag, just because we<br>

wanted to fix a vulnerability.<br>

<br>

Randomized hash has also a major issue: if the attacker is able to<br>

compute the secret, (s)he can easily compute collisions and exploit<br>

the hash collision vulnerability again. I don&#39;t know exactly how<br>

complex it is to compute the secret, but our hash function is weak (it<br>

is far from being cryptographic, it is really simple to run it<br>

backward). If someone writes a fast function to compute the secret, we<br>

will go back to the same point.<br>

<br>

IMO using two hashes has the same disavantages of the randomized hash<br>

solution, whereas it is more complex to implement.<br>

<br>

The last solution is very simple: count collision and raise an<br>

exception if it hits a limit. The path is something like 10 lines<br>

whereas the randomized hash is more close to 500 lines, add a new<br>

file, change Visual Studio project file, etc. First I thaught that it<br>

would break more applications than the randomized hash, but I tried on<br>

Django: the test suite fails with a limit of 20 collisions, but not<br>

with a limit of 50 collisions, whereas the patch uses a limit of 1000<br>

collisions. According to my basic tests, a limit of 35 collisions<br>

requires a dictionary with more than 10,000,000 integer keys to raise<br>

an error. I am not talking about the attack, but valid data.<br>

<br>

More details about my tests on the Django test suite:<br>

<a href="http://bugs.python.org/issue13703#msg151620" target="_blank">http://bugs.python.org/issue13703#msg151620</a><br>

<br>

--<br>

<br>

I propose to solve the hash collision vulnerability by counting<br>

collisions because it does fix the vulnerability with a minor or no<br>

impact on applications or backward compatibility. I don&#39;t see why we<br>

should use a different fix for Python 3.3. If counting collisons<br>

solves the issue for stable versions, it is also enough for Python<br>

3.3. We now know all issues of the randomized hash solution, and I<br>

think that there are more drawbacks than advantages. IMO the<br>

randomized hash is overkill to fix the hash collision issue.<br></blockquote><div><br></div><div>+1</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I just have some requests on Marc Andre Lemburg patch:<br>

<br>

 - the limit should be configurable: a new function in the sys module<br>

should be enough. It may be private (or replaced by an environment<br>

variable?) in stable versions<br>

 - the set type should also be patched (I didn&#39;t check if it is<br>

vulnerable or not using the patch)<br>

 - the patch has no test! (a class with a fixed hash should be enough<br>

to write a test)<br>

 - the limit must be documented somwhere<br>

 - the exception type should be different than KeyError<br>

<br>

Victor<br>

_______________________________________________<br>

Python-Dev mailing list<br>

<a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a><br>

<a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>

Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/guido%40python.org" target="_blank">http://mail.python.org/mailman/options/python-dev/guido%40python.org</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>