[Python-Dev] Counting collisions for the win

Sun Jan 22 01:19:46 CET 2012

On 20 Jan 2012, at 10:49, Brett Cannon wrote:
> Why can't we have our cake and eat it too?
> 
> Can we do hash randomization in 3.3 and use the hash count solution for bugfix releases? That way we get a basic fix into the bugfix releases that won't break people's code (hopefully) but we go with a more thorough (and IMO correct) solution of hash randomization starting with 3.3 and moving forward. We aren't breaking compatibility in any way by doing this since it's a feature release anyway where we change tactics. And it can't be that much work since we seem to have patches for both solutions. At worst it will make merging commits for those files affected by the patches, but that will most likely be isolated and not a common collision (and less of any issue once 3.3 is released later this year).
> 
> I understand the desire to keep backwards-compatibility, but collision counting could cause an error in some random input that someone didn't expect to cause issues whether they were under a DoS attack or just had some unfortunate input from private data. The hash randomization, though, is only weak if someone is attacked, not if they are just using Python with their own private data.

I agree; it sounds really odd to throw an exception since nothing is actually wrong and there's nothing the caller would do about it to recover anyway. Rather than throwing an exception, maybe you just reseed the random value for the hash:
 * this would solve the security issue that someone mentioned about being able to deduce the hash because if they keep being mean it'll change anyway
 * for bugfix, start off without randomization (seed==0) and start to use it only when the collision count hits the threshold
 * for release, reseeding when you hit a certain threshold still seems like a good idea as it will make lookups/insertions better in the long-run

AFAIUI, Python already doesnt guarantee order stability when you insert something into a dictionary, as in the worst case the dictionary has to resize its hash table, and then the order is freshly jumbled again.

Just my two cents.

Jared