Hash collision security issue (now public)

In http://mail.python.org/pipermail/python-dev/2011-December/115138.html, Christian Heimes pointed out that
... we don't have to alter the outcome of hash ... We just need to reduce the chance that an attacker can produce collisions in the dict (and set?)
I'll state it more strongly. hash probably should not change (at least for this), but we may want to consider a different conflict resolution strategy when the first slot is already filled. Remember that there was a fair amount of thought and timing effort put into selecting the current strategy; it is deliberately sub-optimal for random input, in order to do better with typical input. < http://hg.python.org/cpython/file/7010fa9bd190/Objects/dictnotes.txt > If there is a change, it would currently be needed in three places for each of set and dict (the lookdict functions and insertdict_clean). It may be worth adding some macros just to keep those six in sync. Once those macros are in place, that allows a compile-time switch. My personal opinion is that accepting *and parsing* enough data for this to be a problem is enough of an edge case that I don't want normal dicts slowed down at all for this; I would therefore prefer that the change be restricted to such a compile-time switch, with current behavior the default. http://hg.python.org/cpython/file/7010fa9bd190/Objects/dictobject.c#l571 583 for (perturb = hash; ep->me_key != NULL; perturb >>= PERTURB_SHIFT) { 584 i = (i << 2) + i + perturb + 1; PERTURB_SHIFT is already a private #define to 5; per dictnotes, 4 and 6 perform almost as well. Someone worried can easily make that change today, and be protected from "generic" anti-python attacks. I believe the salt suggestions have equivalent to replacing perturb = hash; with something like perturb = hash + salt; Changing i = (i << 2) + i + perturb + 1; would allow effectively replacing the initial hash, but risks spoiling performance in the non-adversary case. Would there be objections to replacing those two lines with something like: for (perturb = FIRST_PERTURB(hash, key); ep->me_key != NULL; perturb = NEXT_PERTURB(hash, key, perturb)) { i = NEXT_SLOT(i, perturb); The default macro definitions should keep things as they are #define FIRST_PERTURB(hash, key) hash #define NEXT_PERTURB(hash, key, perturb) perturb >> PERTURB_SHIFT #define NEXT_SLOT(i, perturb) (i << 2) + i + perturb + 1 while allowing #ifdefs for (slower but) safer things like adding a salt, or even using alternative hashes. -jJ

Jim Jewett wrote:
By compile-time, do you mean when the byte-code is compilated, i.e. just before runtime, rather than a switch when compiling the Python executable from source? I will assume so. I'm not a big fan of compile-time (runtime) switches. It makes it too hard to compare before-and-after behaviour within a single session, and impossible to have fine control over which objects have which behaviour. I don't like all-or-nothing settings. (E.g. I'd love to be able to turn -O optimization on and off on a per-function basis, but can't.) How about using a similar strategy to the current dict behaviour with __missing__ and defaultdict? Here's my suggestion: - If a dict subclass defines __salt__, then it is called to salt the hash value before lookups. If __salt__ is undefined or None, the current behaviour remains unchanged. - Add a dict subclass (saltdict, for lack of a better name) that defines __salt__ appropriately to the collections module. In this case, I don't know enough to suggest what is an appropriate salt. I leave that to the security experts to argue about. - Update the relevant standard library modules to use saltdict where needed. This allows a single application or framework to use saltdict where necessary, without slowing down all dict accesses. Dicts which never see user-generated input (e.g. globals) can remain full-speed. If there is no consensus about the best salting strategy, then apps can choose their own by subclassing dict. Responsibility for doing the right thing falls onto the library author, rather than Python itself. Some people may consider that a minus. -- Steven

Am 31.12.2011 03:19, schrieb Steven D'Aprano:
This was my initial proposal, too. It took me a while to figure out that it won't work. Post-salting won't fix the issue. The random seed must be used as IV inside hashing algorithm. My brain was still in holiday mode and it took me a while to figure out the math. Sorry for any confusion! Christian

On 12/30/2011 8:04 PM, Jim Jewett wrote:
I'll state it more strongly. hash probably should not change (at least for this),
I agree, especially since the vulnerability can be avoided by using 64 bit servers and will generally abate as more switch anyway.
It would be good to have a set of attack strings to see how vulernerable Py dicts actually are (Python may not have been actually tested with data) and the affect of any change. I gave the project email of the 2 presenters in my first post. They apparently want to work with language developers to improve defenses against attack. -- Terry Jan Reedy

Jim Jewett wrote:
By compile-time, do you mean when the byte-code is compilated, i.e. just before runtime, rather than a switch when compiling the Python executable from source? I will assume so. I'm not a big fan of compile-time (runtime) switches. It makes it too hard to compare before-and-after behaviour within a single session, and impossible to have fine control over which objects have which behaviour. I don't like all-or-nothing settings. (E.g. I'd love to be able to turn -O optimization on and off on a per-function basis, but can't.) How about using a similar strategy to the current dict behaviour with __missing__ and defaultdict? Here's my suggestion: - If a dict subclass defines __salt__, then it is called to salt the hash value before lookups. If __salt__ is undefined or None, the current behaviour remains unchanged. - Add a dict subclass (saltdict, for lack of a better name) that defines __salt__ appropriately to the collections module. In this case, I don't know enough to suggest what is an appropriate salt. I leave that to the security experts to argue about. - Update the relevant standard library modules to use saltdict where needed. This allows a single application or framework to use saltdict where necessary, without slowing down all dict accesses. Dicts which never see user-generated input (e.g. globals) can remain full-speed. If there is no consensus about the best salting strategy, then apps can choose their own by subclassing dict. Responsibility for doing the right thing falls onto the library author, rather than Python itself. Some people may consider that a minus. -- Steven

Am 31.12.2011 03:19, schrieb Steven D'Aprano:
This was my initial proposal, too. It took me a while to figure out that it won't work. Post-salting won't fix the issue. The random seed must be used as IV inside hashing algorithm. My brain was still in holiday mode and it took me a while to figure out the math. Sorry for any confusion! Christian

On 12/30/2011 8:04 PM, Jim Jewett wrote:
I'll state it more strongly. hash probably should not change (at least for this),
I agree, especially since the vulnerability can be avoided by using 64 bit servers and will generally abate as more switch anyway.
It would be good to have a set of attack strings to see how vulernerable Py dicts actually are (Python may not have been actually tested with data) and the affect of any change. I gave the project email of the 2 presenters in my first post. They apparently want to work with language developers to improve defenses against attack. -- Terry Jan Reedy
participants (4)
-
Christian Heimes
-
Jim Jewett
-
Steven D'Aprano
-
Terry Reedy