[Python-Dev] Hash collision security issue (now public)

Fri Jan 6 21:06:54 CET 2012

In http://mail.python.org/pipermail/python-dev/2012-January/115350.html,
Mark Shannon wrote:

> The minimal proposed change of seeding the hash from a global value (a
> single memory read and an addition) will have such a minimal performance
> effect that it will be undetectable even on the most noise-free testing
> environment.

(1)  Is it established that this (a single initial add, with no
per-loop operations) would be sufficient?

I thought that was in the gray area of "We don't yet have a known
attack, but there are clearly safer options."

(2)  Even if the direct cost (fetch and add) were free, it might be
expensive in practice.  The current hash function is designed to send
"similar" strings (and similar numbers) to similar hashes.

(2a)  That guarantees they won't (initially) collide, even in very small dicts.
(2b)  It keeps them nearby, which has an effect on cache hits.   The
exact effect (and even direction) would of course depend on the
workload, which makes me distrust micro-benchmarks.

If this were a problem in practice, I could understand accepting a
little slowdown as the price of safety, but ... it isn't.  Even in
theory, the only way to trigger this is to take unreasonable amounts
of user input and turn it directly into an unreasonable number of keys
(as opposed to values, or list elements) placed in the same dict (as
opposed to a series of smaller dicts).

-jJ