[Python-Dev] Hash collision security issue (now public)

Sun Jan 1 05:31:50 CET 2012

On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan <paul at mcmillan.ws> wrote:

> > I'm not too concerned about a 3rd party being able to guess the random
> seed
> > -- this would require much more effort on their part, since they would
> have
> > to generate a new set of colliding keys each time they think they have
> > guessed the hash
>
> This is incorrect. Once an attacker has guessed the random seed, any
> operation which reveals the ordering of hashed objects can be used to
> verify the answer. JSON responses would be ideal. In fact, an attacker
> can do a brute-force attack of the random seed offline. Once they have
> the seed, generating collisions is a fast process.
>

Still, it would represent an effort for the attacker of a much greater
magnitude than the current attack. It's all a trade-off -- at some point
it'll just be easier for the attacker to use some other vulnerability. Also
the class of vulnerable servers would be greatly reduced.

> The goal isn't perfection, but we need to do better than a simple
> salt.

Perhaps.

> I propose we modify the string hash function like this:
>
> https://gist.github.com/0a91e52efa74f61858b5
>
> This code is based on PyPy's implementation, but the concept is
> universal. Rather than choosing a single short random seed per
> process, we generate a much larger random seed (r). As we hash, we
> deterministically choose a portion of that seed and incorporate it
> into the hash process. This modification is a minimally intrusive
> change to the existing hash function, and so should not introduce
> unexpected side effects which might come from switching to a different
> class of hash functions.
>

I'm not sure I understand this. What's the worry about "a different class
of hash functions"? (It may be clear that I do not have a deep mathematical
understanding of hash functions.)

> I've worked through this code with Alex Gaynor, Antoine Pitrou, and
> Victor Stinner, and have asked several mathematicians and security
> experts to review the concept. The reviewers who have gotten back to
> me thus far have agreed that if the initial random seed is not flawed,

I forget -- what do we do on systems without urandom()? (E.g. Windows?)

> this should not overly change the properties of the hash function, but
> should make it quite difficult for an attacker to deduce the necessary
> information to predictably cause hash collisions. This function is not
> designed to protect against timing attacks, but should be nontrivial
> to reverse even with access to timing data.
>

Let's worry about timing attacks another time okay?

> Empirical testing shows that this unoptimized python implementation
> produces ~10% slowdown in the hashing of ~20 character strings. This
> is probably an acceptable trade off, and actually provides better
> performance in the case of short strings than a high-entropy
> fixed-length seed prefix.
>

Hm. I'm not sure I like the idea of extra arithmetic for every character
being hashed. But I like the idea of a bigger random seed from which we
deterministically pick some part. How about just initializing x to some
subsequence of the seed determined by e.g. the length of the hashed string
plus a few characters from it?

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20111231/2b2e346d/attachment.html>