<div class="gmail_quote">On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <span dir="ltr"><<a href="mailto:paul@mcmillan.ws">paul@mcmillan.ws</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">> Still, it would represent an effort for the attacker of a much greater<br>
> magnitude than the current attack. It's all a trade-off -- at some point<br>
> it'll just be easier for the attacker to use some other vulnerability. Also<br>
> the class of vulnerable servers would be greatly reduced.<br>
<br>
</div>I agree that doing anything is better than doing nothing. If we use<br>
the earlier suggestion and prepend everything with a fixed-length<br>
seed, we need quite a bit of entropy (and so a fairly long string) to<br>
make a lasting difference.<br><div class="im"></div></blockquote><div><br>Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer.<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="im">
> I'm not sure I understand this. What's the worry about "a different class of<br>
> hash functions"? (It may be clear that I do not have a deep mathematical<br>
> understanding of hash functions.)<br>
<br>
</div>This was mostly in reference to earlier suggestions of switching to<br>
cityhash, or using btrees, or other more invasive changes. Python 2.X<br>
is pretty stable and making large changes like that to the codebase<br>
can have unpredictable effects.</blockquote><div><br>Agreed.<br> <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We know that the current hash function<br>
works well (except for this specific problem), so it seems like the<br>
best fix will be as minimal a modification as possible, to avoid<br>
introducing bugs.<br><div class="im"></div></blockquote><div><br>Yup.<br> <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im">
> I forget -- what do we do on systems without urandom()? (E.g. Windows?)<br>
</div>Windows has CryptGenRandom which is approximately equivalent.<br>
<div class="im"><br>
> Let's worry about timing attacks another time okay?<br>
</div>Agreed. As long as there isn't a gaping hole, I'm fine with that.<br>
<div class="im"><br>
> Hm. I'm not sure I like the idea of extra arithmetic for every character<br>
> being hashed.<br>
<br>
</div>From a performance standpoint, this may still be better than adding 8<br>
or 10 characters to every single hash operation, since most hashes are<br>
over short strings.</blockquote><div><br>But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.<br><br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
It is important that this function touches every<br>
character - if it only interacts with a subset of them, an attacker<br>
can fix that subset and vary the rest.<br><div class="im"></div></blockquote><div><br>I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.<br>
</div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im">
> But I like the idea of a bigger random seed from which we<br>
> deterministically pick some part.<br>
<br>
</div>Yeah. This makes it much harder to attack, since it very solidly<br>
places the attacker outside the realm of "just brute force the key".<br>
<div class="im"><br>
> How about just initializing x to some<br>
> subsequence of the seed determined by e.g. the length of the hashed string<br>
> plus a few characters from it?<br>
<br>
</div>We did consider this, and if performance is absolutely the prime<br>
directive, this (or a variant) may be the best option. Unfortunately,<br>
the collision generator doesn't necessarily vary the length of the<br>
string. Additionally, if we don't vary based on all the letters in the<br>
string, an attacker can fix the characters that we do use and generate<br>
colliding strings around them.<br></blockquote><div><br>Still, much more work for the attacker.<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Another option to consider would be to apply this change to some but<br>
not all of the rounds. If we added the seed lookup xor operation for<br>
only the first and last 5 values of x, we would still retain much of<br>
the benefit without adding much computational overhead for very long<br>
strings.<br></blockquote><div><br>I like that. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
We could also consider a less computationally expensive operation than<br>
the modulo for calculating the lookup index, like simply truncating to<br>
the correct number of bits.<span class="HOEnZb"></span><br></blockquote></div><br>Sure. Thanks for thinking about all the details here!!<br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>