Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced.
I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference.
I'm not sure I understand this. What's the worry about "a different class of hash functions"? (It may be clear that I do not have a deep mathematical understanding of hash functions.)
This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs.
I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Windows has CryptGenRandom which is approximately equivalent.
Let's worry about timing attacks another time okay?
Agreed. As long as there isn't a gaping hole, I'm fine with that.
Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed.
From a performance standpoint, this may still be better than adding 8
or 10 characters to every single hash operation, since most hashes are over short strings. It is important that this function touches every character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest.
But I like the idea of a bigger random seed from which we deterministically pick some part.
Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of "just brute force the key".
How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it?
We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them.
Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings.
We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits.