<div class="gmail_quote">On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan <span dir="ltr">&lt;<a href="mailto:paul@mcmillan.ws">paul@mcmillan.ws</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im">&gt; I&#39;m not too concerned about a 3rd party being able to guess the random seed<br>

&gt; -- this would require much more effort on their part, since they would have<br>

&gt; to generate a new set of colliding keys each time they think they have<br>

&gt; guessed the hash<br>

<br>

</div>This is incorrect. Once an attacker has guessed the random seed, any<br>

operation which reveals the ordering of hashed objects can be used to<br>

verify the answer. JSON responses would be ideal. In fact, an attacker<br>

can do a brute-force attack of the random seed offline. Once they have<br>

the seed, generating collisions is a fast process.<br></blockquote><div><br>Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It&#39;s all a trade-off -- at some point it&#39;ll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced.<br>


 <br></div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

The goal isn&#39;t perfection, but we need to do better than a simple<br>

salt.</blockquote><div><br>Perhaps.<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I propose we modify the string hash function like this:<br>


<br>

<a href="https://gist.github.com/0a91e52efa74f61858b5" target="_blank">https://gist.github.com/0a91e52efa74f61858b5</a><br>

<br>

This code is based on PyPy&#39;s implementation, but the concept is<br>

universal. Rather than choosing a single short random seed per<br>

process, we generate a much larger random seed (r). As we hash, we<br>

deterministically choose a portion of that seed and incorporate it<br>

into the hash process. This modification is a minimally intrusive<br>

change to the existing hash function, and so should not introduce<br>

unexpected side effects which might come from switching to a different<br>

class of hash functions.<br></blockquote><div><br>I&#39;m not sure I understand this. What&#39;s the worry about &quot;a different class of hash functions&quot;? (It may be clear that I do not have a deep mathematical understanding of hash functions.)<br>


 </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

I&#39;ve worked through this code with Alex Gaynor, Antoine Pitrou, and<br>

Victor Stinner, and have asked several mathematicians and security<br>

experts to review the concept. The reviewers who have gotten back to<br>

me thus far have agreed that if the initial random seed is not flawed,</blockquote><div><br>I forget -- what do we do on systems without urandom()? (E.g. Windows?)<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


this should not overly change the properties of the hash function, but<br>

should make it quite difficult for an attacker to deduce the necessary<br>

information to predictably cause hash collisions. This function is not<br>

designed to protect against timing attacks, but should be nontrivial<br>

to reverse even with access to timing data.<br></blockquote><div><br>Let&#39;s worry about timing attacks another time okay?<br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


Empirical testing shows that this unoptimized python implementation<br>

produces ~10% slowdown in the hashing of ~20 character strings. This<br>

is probably an acceptable trade off, and actually provides better<br>

performance in the case of short strings than a high-entropy<br>

fixed-length seed prefix.<span class="HOEnZb"></span><br></blockquote></div><br>Hm. I&#39;m not sure I like the idea of extra arithmetic for every character being hashed. But I like the idea of a bigger random seed from which we deterministically pick some part. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it?<br clear="all">


<br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>