But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.
That's a fair point. If we go down that avenue, I think simply choosing a random fixed starting value for x is the correct choice, rather than computing an intermediate value.
I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.
Yep, agreed. My original proposal did not adequately address this.
Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings.
I like that.
I believe this is a reasonable solution. An attacker could still manipulate the internal state of long strings, but the additional information at both ends should make that difficult to exploit. I'll talk it over with the reviewers.
We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits.
Sure. Thanks for thinking about all the details here!!
Again, I'll talk to the reviewers (and run the randomness test battery) to be double-check that this doesn't affect the distribution in some unexpected way, but I think it should be fine.
PS. Is the collision-generator used in the attack code open source?
Not in working form, and they've turned down requests for it from other projects that want to check their work. If it's imperative that we have one, I can write one, but I'd rather not spend the effort if we don't need it.