Am 29.12.2011 03:09, schrieb Raymond Hettinger:
FWIW, Uncle Timmy considers the non-randomized hashes to be a virtue. It is believed that they give us better-than-random results for commonly encountered datasets. A change to randomized hashes would have a negative performance impact on those cases.
Also, randomizing the hash wreaks havoc on doctests, book examples not matching actual dict reprs, and on efforts by users to optimize the insertion order into dicts with frequent lookups.
My five cents on the topic:
I totally concur with Raymound. He, Tim and all the others did a fantastic job with the dict implementation and optimization. We shouldn't overreact and mess with the current hashing and dict code just because a well-known and old attack vector pops up again. The dict code is far too crucial for Python's overall performance. However the issue should be documented in our docs.
I've been dealing with web stuff and security for almost a decade. I've seen far worse attack vectors. This one can easily be solved with a couple of lines of Python code. For example Application developers can limit the maximum amount of POST parameters to a sensible amount and limit the length of each key, too. The issue less severe on platforms with 64bit hashes, so it won't affect most people. I think only 32bit Unix and Windows in general (32bit long) are in trouble.
CPython could aid developers with a special subclass of dict. The crucial lookup function is already overwrite-able per dict instance and on subclasses of dict through PyDictObj's struct member PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash). For example specialized subclass could limit the seach for a free slot to n recursions or choose to ignore the hash argument and calculate its own hash of the key.