On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <paul@mcmillan.ws> wrote:
> Still, it would represent an effort for the attacker of a much greater
> magnitude than the current attack. It's all a trade-off -- at some point
> it'll just be easier for the attacker to use some other vulnerability. Also
> the class of vulnerable servers would be greatly reduced.

I agree that doing anything is better than doing nothing. If we use
the earlier suggestion and prepend everything with a fixed-length
seed, we need quite a bit of entropy (and so a fairly long string) to
make a lasting difference.

Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer.
> I'm not sure I understand this. What's the worry about "a different class of
> hash functions"? (It may be clear that I do not have a deep mathematical
> understanding of hash functions.)

This was mostly in reference to earlier suggestions of switching to
cityhash, or using btrees, or other more invasive changes. Python 2.X
is pretty stable and making large changes like that to the codebase
can have unpredictable effects.

We know that the current hash function
works well (except for this specific problem), so it seems like the
best fix will be as minimal a modification as possible, to avoid
introducing bugs.

> I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Windows has CryptGenRandom which is approximately equivalent.

> Let's worry about timing attacks another time okay?
Agreed. As long as there isn't a gaping hole, I'm fine with that.

> Hm. I'm not sure I like the idea of extra arithmetic for every character
> being hashed.

From a performance standpoint, this may still be better than adding 8
or 10 characters to every single hash operation, since most hashes are
over short strings.

But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.

It is important that this function touches every
character - if it only interacts with a subset of them, an attacker
can fix that subset and vary the rest.

I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.
> But I like the idea of a bigger random seed from which we
> deterministically pick some part.

Yeah. This makes it much harder to attack, since it very solidly
places the attacker outside the realm of "just brute force the key".

> How about just initializing x to some
> subsequence of the seed determined by e.g. the length of the hashed string
> plus a few characters from it?

We did consider this, and if performance is absolutely the prime
directive, this (or a variant) may be the best option. Unfortunately,
the collision generator doesn't necessarily vary the length of the
string. Additionally, if we don't vary based on all the letters in the
string, an attacker can fix the characters that we do use and generate
colliding strings around them.

Still, much more work for the attacker.
Another option to consider would be to apply this change to some but
not all of the rounds. If we added the seed lookup xor operation for
only the first and last 5 values of x, we would still retain much of
the benefit without adding much computational overhead for very long

I like that.
We could also consider a less computationally expensive operation than
the modulo for calculating the lookup index, like simply truncating to
the correct number of bits.

Sure. Thanks for thinking about all the details here!!

--Guido van Rossum (python.org/~guido)