<div class="gmail_quote">On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <span dir="ltr">&lt;<a href="mailto:paul@mcmillan.ws">paul@mcmillan.ws</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im">&gt; Still, it would represent an effort for the attacker of a much greater<br>

&gt; magnitude than the current attack. It&#39;s all a trade-off -- at some point<br>

&gt; it&#39;ll just be easier for the attacker to use some other vulnerability. Also<br>

&gt; the class of vulnerable servers would be greatly reduced.<br>

<br>

</div>I agree that doing anything is better than doing nothing. If we use<br>

the earlier suggestion and prepend everything with a fixed-length<br>

seed, we need quite a bit of entropy (and so a fairly long string) to<br>

make a lasting difference.<br><div class="im"></div></blockquote><div><br>Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer.<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div class="im">

&gt; I&#39;m not sure I understand this. What&#39;s the worry about &quot;a different class of<br>

&gt; hash functions&quot;? (It may be clear that I do not have a deep mathematical<br>

&gt; understanding of hash functions.)<br>

<br>

</div>This was mostly in reference to earlier suggestions of switching to<br>

cityhash, or using btrees, or other more invasive changes. Python 2.X<br>

is pretty stable and making large changes like that to the codebase<br>

can have unpredictable effects.</blockquote><div><br>Agreed.<br> <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We know that the current hash function<br>


works well (except for this specific problem), so it seems like the<br>

best fix will be as minimal a modification as possible, to avoid<br>

introducing bugs.<br><div class="im"></div></blockquote><div><br>Yup.<br> <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im">


&gt; I forget -- what do we do on systems without urandom()? (E.g. Windows?)<br>

</div>Windows has CryptGenRandom which is approximately equivalent.<br>

<div class="im"><br>

&gt; Let&#39;s worry about timing attacks another time okay?<br>

</div>Agreed. As long as there isn&#39;t a gaping hole, I&#39;m fine with that.<br>

<div class="im"><br>

&gt; Hm. I&#39;m not sure I like the idea of extra arithmetic for every character<br>

&gt; being hashed.<br>

<br>

</div>From a performance standpoint, this may still be better than adding 8<br>

or 10 characters to every single hash operation, since most hashes are<br>

over short strings.</blockquote><div><br>But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.<br><br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


 It is important that this function touches every<br>

character - if it only interacts with a subset of them, an attacker<br>

can fix that subset and vary the rest.<br><div class="im"></div></blockquote><div><br>I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.<br>


 </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im">

&gt; But I like the idea of a bigger random seed from which we<br>

&gt; deterministically pick some part.<br>

<br>

</div>Yeah. This makes it much harder to attack, since it very solidly<br>

places the attacker outside the realm of &quot;just brute force the key&quot;.<br>

<div class="im"><br>

&gt; How about just initializing x to some<br>

&gt; subsequence of the seed determined by e.g. the length of the hashed string<br>

&gt; plus a few characters from it?<br>

<br>

</div>We did consider this, and if performance is absolutely the prime<br>

directive, this (or a variant) may be the best option. Unfortunately,<br>

the collision generator doesn&#39;t necessarily vary the length of the<br>

string. Additionally, if we don&#39;t vary based on all the letters in the<br>

string, an attacker can fix the characters that we do use and generate<br>

colliding strings around them.<br></blockquote><div><br>Still, much more work for the attacker.<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


Another option to consider would be to apply this change to some but<br>

not all of the rounds. If we added the seed lookup xor operation for<br>

only the first and last 5 values of x, we would still retain much of<br>

the benefit without adding much computational overhead for very long<br>

strings.<br></blockquote><div><br>I like that. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

We could also consider a less computationally expensive operation than<br>

the modulo for calculating the lookup index, like simply truncating to<br>

the correct number of bits.<span class="HOEnZb"></span><br></blockquote></div><br>Sure. Thanks for thinking about all the details here!!<br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>