[Python-Dev] Hash collision security issue (now public)

Thu Dec 29 03:55:22 CET 2011

Am 29.12.2011 03:09, schrieb Raymond Hettinger:
> FWIW, Uncle Timmy considers the non-randomized hashes to be a virtue.
> It is believed that they give us better-than-random results for commonly
> encountered datasets.  A change to randomized hashes would have a
> negative performance impact on those cases.
> 
> Also, randomizing the hash wreaks havoc on doctests, book examples
> not matching actual dict reprs, and on efforts by users to optimize
> the insertion order into dicts with frequent lookups.

My five cents on the topic:

I totally concur with Raymound. He, Tim and all the others did a
fantastic job with the dict implementation and optimization. We
shouldn't overreact and mess with the current hashing and dict code just
because a well-known and old attack vector pops up again. The dict code
is far too crucial for Python's overall performance. However the issue
should be documented in our docs.

I've been dealing with web stuff and security for almost a decade. I've
seen far worse attack vectors. This one can easily be solved with a
couple of lines of Python code. For example Application developers can
limit the maximum amount of POST parameters to a sensible amount and
limit the length of each key, too. The issue less severe on platforms
with 64bit hashes, so it won't affect most people. I think only 32bit
Unix and Windows in general (32bit long) are in trouble.

CPython could aid developers with a special subclass of dict. The
crucial lookup function is already overwrite-able per dict instance and
on subclasses of dict through PyDictObj's struct member PyDictEntry
*(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash). For example
specialized subclass could limit the seach for a free slot to n
recursions or choose to ignore the hash argument and calculate its own
hash of the key.

Christian