[issue13703] Hash collision security issue

Thu Jan 5 10:01:15 CET 2012

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Paul McMillan wrote:
> 
> This is not something that can be fixed by limiting the size of POST/GET. 
> 
> Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered.
> 
> Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use.
> 
> Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python.

Agreed: Limiting the size of POST requests only applies to *web* applications.
Other applications will need other fixes.

Trying to fix the problem in general by tweaking the hash function to
(apparently) make it hard for an attacker to guess a good set of
colliding strings/integers/etc. is not really a good solution. You'd
only be making it harder for script kiddies, but as soon as someone
crypt-analysis the used hash algorithm, you're lost again.

You'd need to use crypto hash functions or universal hash functions
if you want to achieve good security, but that's not an option for
Python objects, since the hash functions need to be as fast as possible
(which rules out crypto hash functions) and cannot easily drop the invariant
"a=b => hash(a)=hash(b)" (which rules out universal hash functions, AFAICT).

IMO, the strategy to simply cap the number of allowed collisions is
a better way to achieve protection against this particular resource
attack. The probability of having valid data reach such a limit is
low and, if configurable, can be made 0.

> Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. 

... not in Python itself, but if you consider all the types in Python
extensions and classes implementing __hash__ in user code, the number
of hash functions to fix quickly becomes unmanageable.

> Marc-Andre:
> If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key.

Could you post it on the ticket ?

BTW: I wonder how long it's going to take before someone figures out
that our merge sort based list.sort() is vulnerable as well... its
worst-case performance is O(n log n), making attacks somewhat harder.
The popular quicksort which Python used for a long time has O(n²),
making it much easier to attack, but fortunately, we replaced it
with merge sort in Python 2.3, before anyone noticed ;-)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________