Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having &gt;1000 collisions are way smaller than the chances that somebody&#39;s code will break due to the variable hashing. In fact we know for a fact that the latter will break code, since it changes the order of items in a dict. This affects many  tests written without this in mind, and I assume there will be some poor sap out there who uses Python&#39;s hash() function to address some external persistent hash table or some other external datastructure. How pathological the data needs to be before the collision counter triggers? I&#39;d expect *very* pathological.<br>


<br>This is depending on how the counting is done (I didn&#39;t look at MAL&#39;s patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different.<br>


<br>That said, even with collision counting I&#39;d like a way to disable it without changing the code, e.g. a flag or environment variable.<br><br>--Guido<br><br><div class="gmail_quote">On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner <span dir="ltr">&lt;<a href="mailto:victor.stinner@haypocalc.com">victor.stinner@haypocalc.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Many people proposed their own idea to fix the vulnerability, but only<br>

3 wrote a patch:<br>

<br>

- Glenn Linderman proposes to fix the vulnerability by adding a new<br>

&quot;safe&quot; dict type (only accepting string keys). His proof-of-concept<br>

(SafeDict.py) uses a secret of 64 random bits and uses it to compute<br>

the hash of a key.<br>

- Marc Andre Lemburg proposes to fix the vulnerability directly in<br>

dict (for any key type). The patch raises an exception if a lookup<br>

causes more than 1000 collisions.<br>

- I propose to fix the vulnerability only in the Unicode hash (not for<br>

other types). My patch adds a random secret initialized at startup (it<br>

can be disabled or fixed using an environment variable).<br>

<br>

--<br>

<br>

I consider that Glenn&#39;s proposition is not applicable in practice<br>

because all applications and all libraries have to be patched to use<br>

the new &quot;safe&quot; dict type.<br>

<br>

Some people are concerned by possible regression introduced by Marc&#39;s<br>

proposition: his patch may raise an exception for legitimate data.<br>

<br>

My proposition tries to be &quot;just enough&quot; secure with a low (runtime<br>

performance) overhead. My patch becomes huge (and so backporting is<br>

more complex), whereas Marc&#39;s patch is very simple and so trivial to<br>

backport.<br>

<br>

--<br>

<br>

It is still unclear to me if the fix should be enabled by default for<br>

Python &lt; 3.3. Because the overhead (of my patch) is low, I would<br>

prefer to enable the fix by default, to protect everyone with a simple<br>

Python upgrade.<br>

<br>

I prefer to explain how to disable explicitly the randomized hash<br>

(PYTHONHASHSEED=0) (or how to fix application bugs) to people having<br>

troubles with randomized hash, instead of leaving the hole open by<br>

default.<br>

<br>

--<br>

<br>

We might change hash() for types other than str, but it looks like web<br>

servers are only concerned by dict with string keys.<br>

<br>

We may use Paul&#39;s hash function if mine is not enough secure.<br>

<br>

My patch doesn&#39;t fix the DoS, it just make the attack more complex.<br>

The attacker cannot pregenerate data for an attack: (s)he has first to<br>

compute the hash secret, and then compute hash collisions using the<br>

secret. The hash secret is a least 64 bits long (128 bits on a 64 bit<br>

system). So I hope that computing collisions requires a lot of CPU<br>

time (is slow) to make the attack ineffective with today computers.<br>

<br>

--<br>

<br>

I plan to write a nice patch for Python 3.3, then write a simpler<br>

patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,<br>

maybe don&#39;t create a new random.c file, maybe don&#39;t touch the test<br>

suite while the patch breaks many tests), and finally write patches<br>

for Python 2.6 and 2.7.<br>

<br>

Details about my patch:<br>

<br>

- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)<br>

- a new PYTHONSEED environment variable allow to control the<br>

randomized hash: PYTHONSEED=0 disables completly the randomized hash<br>

(restore the previous behaviour), PYTHONSEED=value uses a fixed seed<br>

for processes sharing data and needind same hash values<br>

(multiprocessing users?)<br>

- no overhead on hash(str)<br>

- no startup overhead on Linux<br>

- startup overhead is 10% on Windows (see the issue, I propose another<br>

solution with a startup overhead of 1%)<br>

<br>

The patch is not done, some tests are still failing because of the<br>

randomized hash.<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

<br>

FYI, PHP released a version 5.3.9 adding &quot;max_input_vars directive to<br>

prevent attacks based on hash collisions (CVE-2011-4885)&quot;.<br>

<br>

Victor<br>

_______________________________________________<br>

Python-Dev mailing list<br>

<a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a><br>

<a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>

Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/guido%40python.org" target="_blank">http://mail.python.org/mailman/options/python-dev/guido%40python.org</a><br>

</font></span></blockquote></div><br><br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>