[Python-Dev] Counting collisions for the win

Fri Jan 20 12:46:39 CET 2012

2012/1/20 Ivan Kozik <ivan at ludios.org>:
> I'd like to point out that an attacker is not limited to sending just
> one dict full of colliding keys.  Given a 22ms stall ...

The presented attack produces a stall of at least 30 seconds (5
minutes or more if there is no time limit in the application), not
0.022 second. You have to send a lot of requests to produce a DoS if a
single requests just eat 22 ms. I suppose that there are a lot of
other kinds of request than takes much longer than 22 ms, even valid
requests.

> Another issue is that even with a configurable limit, different
> modules can't have their own limits.  One module might want a
> relatively safe raise-at-100, and another module creating massive
> dicts might want raise-at-1000.  How does a developer know whether
> they can raise or lower the limit, given that they use a bunch of
> different modules?

Python becomes really slow when you have more than N collisions
(O(n^2) problem). If an application hits this limit with valid data,
it is time to use another data structure or use a different hash
function. We have to do more tests to choose correctly N, but
according my first tests, it looks like N=1000 is a safe limit.

Marc Andre's patch doesn't count all "collisions", but only
"collisions" requiring to compare objects. When two objects have the
same hash value, the open addressing algorithm searchs a free bucket.
If a bucket is not free but has a different hash value, the objects
are not compared and the collision counter is not incremented. The
limit is only reached when you have N objects having the same hash
value modulo the size of the bucket (hash(str) & DICT_MASK).

When there are not enough empty buckets (it comes before all buckets
are full), Python resizes the dictionary (it does something like size
= size * 2) and so it uses at least one more bit each time than the
dictionary is resized. Collisions are very likely with a small
dictioanry, but becomes more rare each time than the dictionary is
resized. It means that the number of potential collisions (with valid
data) decreases when the dictionary grows. Tell me if I am wrong.