Status of the fix for the hash collision vulnerability

Many people proposed their own idea to fix the vulnerability, but only 3 wrote a patch: - Glenn Linderman proposes to fix the vulnerability by adding a new "safe" dict type (only accepting string keys). His proof-of-concept (SafeDict.py) uses a secret of 64 random bits and uses it to compute the hash of a key. - Marc Andre Lemburg proposes to fix the vulnerability directly in dict (for any key type). The patch raises an exception if a lookup causes more than 1000 collisions. - I propose to fix the vulnerability only in the Unicode hash (not for other types). My patch adds a random secret initialized at startup (it can be disabled or fixed using an environment variable). -- I consider that Glenn's proposition is not applicable in practice because all applications and all libraries have to be patched to use the new "safe" dict type. Some people are concerned by possible regression introduced by Marc's proposition: his patch may raise an exception for legitimate data. My proposition tries to be "just enough" secure with a low (runtime performance) overhead. My patch becomes huge (and so backporting is more complex), whereas Marc's patch is very simple and so trivial to backport. -- It is still unclear to me if the fix should be enabled by default for Python < 3.3. Because the overhead (of my patch) is low, I would prefer to enable the fix by default, to protect everyone with a simple Python upgrade. I prefer to explain how to disable explicitly the randomized hash (PYTHONHASHSEED=0) (or how to fix application bugs) to people having troubles with randomized hash, instead of leaving the hole open by default. -- We might change hash() for types other than str, but it looks like web servers are only concerned by dict with string keys. We may use Paul's hash function if mine is not enough secure. My patch doesn't fix the DoS, it just make the attack more complex. The attacker cannot pregenerate data for an attack: (s)he has first to compute the hash secret, and then compute hash collisions using the secret. The hash secret is a least 64 bits long (128 bits on a 64 bit system). So I hope that computing collisions requires a lot of CPU time (is slow) to make the attack ineffective with today computers. -- I plan to write a nice patch for Python 3.3, then write a simpler patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged, maybe don't create a new random.c file, maybe don't touch the test suite while the patch breaks many tests), and finally write patches for Python 2.6 and 2.7. Details about my patch: - I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits) - a new PYTHONSEED environment variable allow to control the randomized hash: PYTHONSEED=0 disables completly the randomized hash (restore the previous behaviour), PYTHONSEED=value uses a fixed seed for processes sharing data and needind same hash values (multiprocessing users?) - no overhead on hash(str) - no startup overhead on Linux - startup overhead is 10% on Windows (see the issue, I propose another solution with a startup overhead of 1%) The patch is not done, some tests are still failing because of the randomized hash. -- FYI, PHP released a version 5.3.9 adding "max_input_vars directive to prevent attacks based on hash collisions (CVE-2011-4885)". Victor

Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having pathological the data needs to be before the collision counter triggers? I'd expect *very* pathological. This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different. That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable. --Guido On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner < victor.stinner@haypocalc.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <guido@python.org> wrote:
How pathological do you consider the set {1 << n for n in range(2000)} to be? What about the set: ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)} ? The > 2000 elements of the latter set have only 61 distinct hash values on 64-bit machine, so there will be over 2000 total collisions involved in creating this set (though admittedly only around 30 collisions per hash value). -- Mark

On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <guido@python.org> wrote:
Ah, my bad. It looks like the ieee754_powers_of_two is safe---IIUC, it's the number of collisions involved in a single key-set operation that's limited. So a dictionary with keys {1<<n for n in range(2000)} is fine, but a dictionary with keys {1<<(61*n) for n in range(2000)} is not:
I'd still not consider this particularly pathological, though. -- Mark

Am 14.01.2012 01:37, schrieb Benjamin Peterson:
There are two concerns here: - is it possible to come up with an example of constructed values that show many collisions in a way that poses a threat? To this, the answer is apparently "yes", and the proposed reaction is to hard-limit the number of collisions accepted by the implementation. - then, *assuming* such a limitation is in place: is it possible to come up with a realistic application that would break under this limitation. Mark's example is no such realistic application, instead, it is yet another example demonstrating collisions using constructed values (although the specific example would continue to work fine even under the limitation). A valid counterexample would have to come from a real application, or at least from a scenario that is plausible for a real application. Regards, Martin

Am 13.01.2012 18:08, schrieb Mark Dickinson:
I think this is not a counter-example for the proposed algorithm (at least not in the way I think it should be implemented). Those values may collide on the slot in the set, but they don't collide on the actual hash value. So in order to determine whether the collision limit is exceeded, we shouldn't count colliding slots, but colliding hash values (which we will all encounter during an insert).
though admittedly only around 30 collisions per hash value.
I do consider the case of hashing integers with only one bit set pathological. However, this can be overcome by factoring the magnitude of the number into the hash as well. Regards, Martin

On Thu, 12 Jan 2012 18:57:42 -0800 Guido van Rossum <guido@python.org> wrote:
Breaking due to variable hashing is deterministic: you notice it as soon as you upgrade (and then you use PYTHONHASHSEED to disable variable hashing). That seems better than unpredictable breaking when some legitimate collision chain happens. Regards Antoine.

On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those. FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change. -- --Guido van Rossum (python.org/~guido)

On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <guido@python.org> wrote:
Agreed. Of the three options Victor listed only one is good. I don't like *SafeDict*. *-1*. It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*. "Safe" needs to be the default option for all hash tables. I don't like the "*too many hash collisions*" exception. *-1*. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior. I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be back ported to any Python version. It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We *will*provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies. This approach worked fine for Perl 9 years ago. https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371 -gps

On 14/01/12 12:58, Gregory P. Smith wrote:
For the record: steve@runes:~$ python -c "print(hash('spam ham'))" -376510515 steve@runes:~$ jython -c "print(hash('spam ham'))" 2054637885 So it is already the case that Python code that assumes stable hashing is broken. For what it's worth, I'm not convinced that we should be overly-concerned by "poor saps" (Guido's words) who rely on accidents of implementation regarding hash. We shouldn't break their code unless we have a good reason, but this strikes me as a good reason. The documentation for hash certainly makes no promise about stability, and relying on it strikes me as about as sensible as relying on the stability of error messages. I'm also not convinced that the option to raise an exception after 1000 collisions actually solves the problem. That relies on the application being re-written to catch the exception and recover from it (how?). Otherwise, all it does is change the attack vector from "cause an indefinite number of hash collisions" to "cause 999 hash collisions followed by crashing the application with an exception", which doesn't strike me as much of an improvement. +1 on random seeding. Default to on in 3.3+ and default to off in older versions, which allows people to avoid breaking their code until they're ready for it to be broken. -- Steven

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg@krypto.org> wrote:
What an implementation looks like: http://pastebin.com/9ydETTag some stuff to be filled in, but this is all that is really required. add logic to allow a particular seed to be specified or forced to 0 from the command line or environment. add the logic to grab random bytes. add the autoconf glue to disable it. done. -gps

I think this statement (and the patch) is wrong. You also need to change the byte string hashing, at least for 2.x. This I consider the biggest flaw in that approach - other people may have written string-like objects which continue to compare equal to a string but now hash different. Regards, Martin

On Sat, 14 Jan 2012 04:45:57 +0100 martin@v.loewis.de wrote:
They're unlikely to have rewritten the hash algorithm by hand - especially given the caveats wrt. differences between Python integers and C integers. Rather, they would have returned the hash() of the equivalent str or unicode object. Regards Antoine.

See the CHAR_HASH macro in http://hg.python.org/cpython/file/e78f00dbd7ae/Modules/expat/xmlparse.c It's not *that* unlikely that more copies of that algorithm exist. Regards, Martin

FWIW the quick change i pastebin'ed is basically covered by the change already under review in http://bugs.python.org/review/13704/show. I've made my comments and suggestions there. I looked into Modules/expat/xmlparse.c and it has an odd copy of the old string hash algorithm entirely for its own internal use and its own internal hash table implementations. That module is likely vulnerable to creatively crafted documents for the same reason. With 13704 and the public API it provides to get the random hash seed, that module could simply be updated to use that in its own hash implementation. As for when to enable it or not, I unfortunately have to agree, despite my wild desires we can't turn on the hash randomization change by default in anything prior to 3.3. -gps On Sat, Jan 14, 2012 at 11:17 AM, Gregory P. Smith <greg@krypto.org> wrote:

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg@krypto.org> wrote:
No, that is not how we usually take compatibility between bugfix releases. "Your code is already broken" is not an argument to break forcefully what worked (even if by happenstance) before. The difference between CPython and Jython (or between different CPython feature releases) also isn't relevant -- historically we have often bent over backwards to avoid changing behavior that was technically undefined, if we believed it would affect a significant fraction of users. I don't think anyone doubts that this will break lots of code (at least, the arguments I've heard have been "their code is broken", not "nobody does that"). This approach worked fine for Perl 9 years ago.
I don't know what the Perl attitude about breaking undefined behavior between micro versions was at the time. But ours is pretty clear -- don't do it. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum wrote:
I don't know about "lots" of code, but it will break at least one library (or so I'm told): http://mail.python.org/pipermail/python-list/2012-January/1286535.html -- Steven

I don't think that it would be hard to patch this library to use another hash function. It can implement its own hash function, use MD5, SHA1, or anything else. hash() is not stable accross Python versions and 32/64 bit systems. Victor 2012/1/15 Hynek Schlawack <hs@ox.cx>:

Am 15.01.2012 15:27, schrieb Victor Stinner:
As I wrote in a reply further down: no, it isn't hard to change this behaviour (and I find the current caching system, which uses hash() on an URL to choose the cache index, braindead to begin with), but, as with all other considerations: the current version of the library, with the default options, depends on hash() to be stable for the cache to make any sense at all (and especially with "generic" schema such as the referenced xml.dtd, caching makes a lot of sense, and not being able to cache _breaks_ applications as it did mine). This is juts something to bear in mind. -- --- Heiko.

On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken.
Given that the doc says "Return the hash value of the object", I do not think we should be so hard-nosed. The above clearly implies that there is such a thing as *the* Python hash value for an object. And indeed, that has been true across many versions. If we had written "Return a hash value for the object, which can vary from run to run", the case would be different. -- Terry Jan Reedy

Terry Reedy, 14.01.2012 06:43:
Just a side note, but I don't think hash() is the right place to document this. Hashing is a protocol in Python, just like indexing or iteration. Nothing keeps an object from changing its hash value due to modification, and that would even be valid in the face of the usual dict lookup invariants if changes are only applied while the object is not referenced by any dict. So the guarantees do not depend on the function hash() and may be even weaker than your above statement. Stefan

On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
You mean we shouldn't document that the hash() of a string will vary per run?
Hashing is a protocol in Python, just like indexing or iteration. Nothing keeps an object from changing its hash value due to modification,
Eh? There's a huge body of cultural awareness that only immutable objects should define a hash, implying that the hash remains constant during the object's lifetime.
And how would you know it isn't?
So the guarantees do not depend on the function hash() and may be even weaker than your above statement.
There are no actual guarantees for hash(), but lots of rules for well-behaved hashes. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 15.01.2012 17:10:
No, I mean that the hash() builtin function is not the right place to document the behaviour of a string hash. That should go into the string object documentation. Although, arguably, it may be worth mentioning in the docs of hash() that, in general, hash values of builtin types are bound to the lifetime of the interpreter instance (or entire runtime?) and may change after restarts. I think that's a reasonable restriction to document that prominently, even if it will only apply to str for the time being.
Well, if it's an object with a mutable hash then it's up to the application defining that object to make sure it's used in a sensible way. Immutability just makes your life easier. I can imagine that an object gets removed from a dict (say, a cache), modified and then reinserted, and I think it's valid to allow the modification to have an impact on the hash in this case, in order to accommodate for any changes to equality comparisons due to the modification. That being said, it seems that the Python docs actually consider constant hashes a requirement rather than a virtue. http://docs.python.org/glossary.html#term-hashable """ An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value. """ It also seems to me that the wording "has a hash value which never changes during its lifetime" makes it pretty clear that the lifetime of the hash value is not guaranteed to supersede the lifetime of the object (although that's a rather muddy definition - memory lifetime? or pickle-unpickle as well?). However, this entry in the glossary only seems to have appeared with Py2.6, likely as a result of the abc changes. So it won't help in defending a change to the hash function.
Absolutely. Stefan

On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Lifetime to me means of that specific instance of the object. I would not expect that to survive pickle-unpickle.
Ugh, I really hope there is no code out there depending on the hash function being the same across a pickle and unpickle boundary. Unfortunately the hash function was last changed in 1996 in http://hg.python.org/cpython/rev/839f72610ae1 so it is possible someone somewhere has written code blindly assuming that non-guarantee is true. -gps

On Sun, 15 Jan 2012 17:46:36 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
No, but we can document that *any* hash() value can vary between runs without being specific about which builtin types randomize their hashes right now. Regards Antoine.

On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.
That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).
Across pickle-unpickle it's not considered the same object. Pickling at best preserves values. However, this entry in the glossary only seems to have appeared with Py2.6,
-- --Guido van Rossum (python.org/~guido)

On Sun, Jan 15, 2012 at 9:44 AM, Guido van Rossum <guido@python.org> wrote:
Updating the docs to explicitly clarify this sounds like a good idea. How does this wording to be added to the glossary.rst hashing section sound? """Hash values may not be stable across Python processes and must not be used for storage or otherwise communicated outside of a single Python session.""" -gps

On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum <guido@python.org> wrote:
Python's dicts are designed to avoid hash conflicts by resizing and keeping the available slots bountiful. 1000 conflicts sounds like a number that couldn't be hit accidentally unless you had a single dict using a terabyte of RAM (i.e. if Titus Brown doesn't object, we're good). The hashes also look to exploit cache locality but that is very unlikely to get one thousand conflicts by chance. If you get that many there is an attack.
The patch counts conflicts on an individual insert and not lifetime conflicts. Looks sane to me.
That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.
Agreed. Paranoid people can turn the behavior off and if it ever were to become a problem in practice we could point people to a solution. -Jack

On Sat, Jan 14, 2012 at 4:24 PM, Jack Diederich <jackdied@gmail.com> wrote:
Having a hard limit on the worst-case behaviour certainly sounds like an attractive prospect. And there's nothing to worry about in terms of secrecy or sufficient randomness - by default, attackers cannot generate more than 1000 hash collisions in one lookup, period.
Does MAL's patch allow the limit to be set on a per-dict basis (including setting it to None to disable collision limiting completely)? If people have data sets that need to tolerate that kind of collision level (and haven't already decided to move to a data structure other than the builtin dict), then it may make sense to allow them to remove the limit when using trusted input. For maintenance versions though, it would definitely need to be possible to switch it off without touching the code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Jack Diederich writes:
I may be missing something, but AIUI, with the resize, the search for an unused slot after collision will be looking in a different series of slots, so the N counter for the N^2 behavior resets on resize. If not, you can delete this message now. If so, since (a) in the error-on-many-collisions approach we're adding a test here for collision count anyway and (b) we think this is almost never gonna happen, can't we defuse the exploit by just resizing the dict after 1000 collisions, with strictly better performance than the error approach, and almost current performance for "normal" input? In order to prevent attackers from exploiting every 1000th collision to force out-of-memory, the expansion factor for collision-induced resizing could be "very small". (I don't know if that's possible in the Python dict implementation, if the algorithm requires something like doubling the dict size on every resize this is right out, of course.) Or, since this is an error/rare path anyway, offer the user a choice of an error or a resize on hitting 1000 collisions?

Am 13.01.2012 02:24, schrieb Victor Stinner:
collisions on one core with no precomputed data. I'm sure it's possible to make this less than a second.
In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^ suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is possible. So the question is: How difficult is it to guess the seed? Frank

Unfortunately it requires only a few seconds to compute enough 32bit collisions on one core with no precomputed data.
Are you running the hash function "backward" to generate strings with the same value, or you are more trying something like brute forcing? And how do you get the hash secret? You need it to run an attack.
My change adds also a prefix (a prefix and a suffix). I don't know if it changes anything for generating collisions.
So the question is: How difficult is it to guess the seed?
I wrote some remarks about that in the issue. For example: (hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) & HASH_MASK) ^ ((prefix * 1000003**2) & HASH_MASK) I suppose that you don't have directly the full output of hash(str) in practical, but hash(str) & DICT_MASK where DICT_MASK depends is the size of the internal dict array minus 1. For example, for a dictionary of 65,536 items, the mask is 0x1ffff and so cannot gives you more than 17 bits of hash(str) output. I still don't know how difficult it is to retreive hash(str) bits from repr(dict). Victor

If you try it brute force to hit a specific target, you'll only find only one good string every 4 billion tries. That's why you first blow up your target: You start backward from an arbitrary target-value. You brute force for 3 characters, for example, this will give you 16 million intermediate values from which you know that they'll end up in your target-value. Those 16 million values are a huge target for now brute-forcing forward: Every 256 tries you'll hit one of these values.
And how do you get the hash secret? You need it to run an attack.
I don't know. This was meant as an answer to the quoted text "So I hope that computing collisions requires a lot of CPU time (is slow) to make the attack ineffective with today computers.". What I wanted to say is: The security relies on the fact that the attacker can't guess the prefix, not that he can't precompute the values and it takes hours or days to compute the collisions. If the prefix leaks out of the application, then the rest is trivial and done in a few seconds. The suffix is not important for the collision-prevention, but it will probably make it much harder to guess the prefix. I don't know an effective way to get the prefix either, (if the application doesn't leak full hash(X) values). Frank

On Fri, Jan 13, 2012 at 02:24, Victor Stinner <victor.stinner@haypocalc.com> wrote:
This is my preferred solution. The vulnerability is basically only in the dictionary you keep the form data you get from a request. This solves it easily and nicely. It can also be a separate module installable for Python 2, which many web frameworks still use, so it can be practical implementable now, and not in a couple of years. Then again, nothing prevents us from having both this, *and* one of the other solutions. :-) //Lennart

On 2012-01-13 11:20, Lennart Regebro wrote:
The vulnerability is basically only in the dictionary you keep the form data you get from a request.
I'd have to disagree with this statement. The vulnerability is anywhere that creates a dictionary (or set) from attacker-provided keys. That would include HTTP headers, RFC822-family subheaders and parameters, the environ, input taken from JSON or XML, and so on - and indeed hash collision attacks are not at all web-specific. The problem with having two dict implementations is that a caller would have to tell libraries that use dictionaries which implementation to use. So for example an argument would have to be passed to json.load[s] to specify whether the input was known-sane or potentially hostile. Any library could ever use dictionaries to process untrusted input *or any library that used another library that did* would have to pass such a flag through, which would quickly get very unwieldy indeed... or else they'd have to just always use safedict, in which case we're in pretty much the same position as we are with changing dict anyway. -- And Clover mailto:and@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobince@gmail.com

We could mix Marc's collision counter with SafeDict idea (being able to use a different secret for each dict): use hash(key, secret) (simple example: hash(secret+key)) instead of hash(key) in dict (and set), and change the secret if we have more than N collisions. But it would slow down all dict lookup (dict creation, get, set, del, ...). And getting new random data can also be slow. SafeDict and hash(secret+key) lose the benefit of the cached hash result. Because the hash result depends on a argument, we cannot cache the result anymore, and we have to recompute the hash for each lookup (even if you lookup the same key twice ore more). Victor

On 1/13/2012 5:35 PM, Victor Stinner wrote:
So integrating SafeDict into dict so it could be automatically converted would mean changing the data structures underneath dict. Given that, a technique for hash caching could be created, that isn't quite as good as the one in place, but may be less expensive than not caching the hashes. It would also take more space, a second dict, internally, as well as the secret. So once the collision counter reaches some threshold (since there would be a functional fallback, it could be much lower than 1000), the secret is obtained, and the keys are rehashed using hash(secret+key). Now when lookups occur, the object id of the key and the hash of the key are used as the index and hash(secret+key) is stored as a cached value. This would only benefit lookups by the same object, other objects with the same key value would be recalculated (at least the first time). Some limit on the number of cached values would probably be appropriate. This would add complexity, of course, in trying to save time. An alternate solution would be to convert a dict to a tree once the number of collisions produces poor performance. Converting to a tree would result in O(log N) instead of O(1) lookup performance, but that is better than the degenerate case of O(N) which is produced by the excessive number of collisions resulting from an attack. This would require new tree code to be included in the core, of course, probably a red-black tree, which stays balanced. In either of these cases, the conversion is expensive, because a collision threshold must first be reached to determine the need for conversion, so the hash could already contain lots of data. If it were too expensive, the attack could still be effective. Another solution would be to change the collision code, so that colliding keys don't produce O(N) behavior, but some other behavior. Each colliding entry could convert that entry to a tree of entries, perhaps. This would require no conversion of "bad dicts", and an attack could at worst convert O(1) performance to O(log N). Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen. I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things. But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough. There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on. -gps

btw, Tim's commit message on this one is amusingly relevant. :) http://hg.python.org/cpython/diff/8d2bbbbf2cb9/Objects/dictobject.c On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith <greg@krypto.org> wrote:

Victor Stinner wrote:
Am I missing something? How does this fix the vulnerability? It seems to me that the only thing this does is turn one sort of DOS attack into another sort of DOS attack: hostile users will just cause hash collisions until an exception is raised and the application falls over. Catching these exceptions, and recovering from them (how?), would be the responsibility of the application author. Given that developers are unlikely to ever see 1000 collisions by accident, or even realise that it could happen, I don't expect that many people will do that -- until they personally get bitten. -- Steven

On Sun, Jan 15, 2012 at 2:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
As I understand it, the way the attack works is that a *single* malicious request from the attacker can DoS the server by eating CPU resources while evaluating a massive collision chain induced in a dict by attacker supplied data. Explicitly truncating the collision chain boots them out almost immediately (likely with a 500 response for an internal server error), so they no longer affect other events, threads and processes on the same machine. In some ways, the idea is analogous to the way we implement explicit recursion limiting in an attempt to avoid actually blowing the C stack - we take a hard-to-detect-and-hard-to-handle situation (i.e. blowing the C stack or malicious generation of long collision chains in a dict) and replace it with something that is easy to detect and can be handled by normal exception processing (i.e. a recursion depth exception or one reporting an excessive number of slot collisions in a dict lookup). That then makes the default dict implementation safe from this kind of attack by default, and use cases that are getting that many collisions legitimately can be handled in one of two ways: - switch to a more appropriate data type (if you're getting that many collisions with benign data, a dict is probably the wrong container to be using) - offer a mechanism (command line switch or environment variable) to turn the collision limiting off Now, where you can still potentially run into problems is if a single shared dict is used to store both benign and malicious data - if the malicious data makes it into the destination dict before the exception finally gets triggered, and then benign data also happens to trigger the same collision chain, then yes, the entire app may fall over. However, such an app would have been crippled by the original DoS anyway, since its performance would have been gutted - the collision chain limiting just means it will trigger exceptions for the cases that would been insanely slow. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

This is only true in the specific attack presented at 28c3. If an attacker can insert data without triggering the attack, it's possible to produce (in the example of a web application) urls that (regardless of the request) always produce pathological behavior. For example, a collection of pathological usernames might make it impossible to list users (and so choose which ones to delete) without resorting to removing the problem data at an SQL level. This is why the "simply throw an error" solution isn't a complete fix. Making portions of an interface unusable for regular users is clearly a bad thing, and is clearly applicable to other types of poisoned data as well. We need to detect collisions and work around them transparently.
We can do better than saying "it would have been broken before, it's broken differently now". The universal hash function idea has merit, and for practical purposes hash randomization would fix this too (since colliding data is only likely to collide within a single process, persistent poisoning is far less feasible). -Paul

On 17 January 2012 09:23, Paul McMillan <paul@mcmillan.ws> wrote:
What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount? Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail). It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for. Thoughts? Tim Delaney

On 17 January 2012 10:14, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
And one thought I had immediately after hitting send is that there could be an attack of the form "build a huge dict, then hit it with something that causes it to rehash due to >1000 collisions". But that's not really going to be any worse than just building a huge dict and hitting a resize anyway. Tim Delaney

2012/1/17 Tim Delaney <timothy.c.delaney@gmail.com>:
What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount?
It doesn't change anything, you will still get collisions. Victor

On Mon, Jan 16, 2012 at 4:16 PM, Victor Stinner < victor.stinner@haypocalc.com> wrote:
That depends right? If the collision is because they all have the same hash(), yes. It might be different if it is because the secondary hashing (or whatever it's called :-) causes collisions. -- --Guido van Rossum (python.org/~guido)

But Python deals with the latter case just fine already. The open hashing approach relies on the dict resizing "enough" to prevent collisions after the dictionary has grown. Unless somebody can demonstrate a counter example, I believe this discussion is a red herring. Plus: if an attacker could craft keys that deliberately cause collisions because of the dictionary size, they could likely also craft keys in the same number that collide on actual hash values, bringing us back to the original problem. Regards, Martin

martin@v.loewis.de writes:
I thought that the original problem was that with N insertions in the dictionary, by repeatedly inserting different keys generating the same hash value an attacker could arrange that the cost of finding an open slot is O(N), and thus the cost of N insertions is O(N^2). If so, frequent resizing could make the attacker's problem much more difficult, as the distribution of secondary probes should change with each resize.

The attack creates 60,000 strings (or more) with exactly the same hash value. A dictionary uses hash(str) & DICT_MASK to compute the bucket index, where DICT_HASH is the number of buckets minus one. If all strings have the same hash value, we always start in the same bucket and the key has to be compared to all previous strings to find the next empty bucket. The attack works because a LOT of strings are compared and comparing strings is slow. If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but hash(str1)!=hash(str2), strings are not compared (this is a common optimization in Python), and the so the attack would not be successful (it would be slow, but not as slow as comparing two strings). Victor

Not sure what you mean by "distribution of secondary probes". Let H be the initial hash value, and let MASK be the current size of the dictionary. Then I(n), the sequence of dictionary indices being probed, is computed as I(0) = H & MASK PERTURB(0) = H I(n+1) = (5*I(n) + 1 + PERTURB(n)) & MASK PERTURN(n+1) = PERTURB(n) >> 5 So if two objects O1 and O2 have the same hash value H, the sequence of probed indices is the same for any MASK value. It will be a different sequence, yes, but they will still collide on each and every slot. This is the very nature of open addressing. If it *wouldn't* try all indices in the probe sequence, it may not be possible to perform the lookup for a key correctly. Regards, Martin

On 01/17/2012 09:29 PM, "Martin v. Löwis" wrote:
Open addressing can still deploy a collision resolution mechanism without this property. For example, double hashing uses a different hash function (applied to the key) to calculate PERTURB(0). To defeat it, the attacker would have to produce keys that hash the same using both hash functions. Double hashing is not a good general solution for Python dicts because it complicates the interface of hash tables that support arbitrary keys. Still, it could be considered for dicts with known key types (built-ins could hardcode the alternative hash function) or for SafeDicts, if they are still considered. Hrvoje

I finished my patch transforming hash(str) to a randomized hash function, see random-8.patch attached to the issue: http://bugs.python.org/issue13703 The remaining question is which random number generator should be used on Windows to initialize the hash secret (CryptoGen adds an overhead of 10%, at least when the DLL is loaded dynamically), read the issue for the details. I plan to commit my fix to Python 3.3 if it is accepted. Then write a simplified version to Python 3.2 and backport it to 3.1. Then backport the simplified fix to 2.7, and finally to 2.6. The vulnerability is public since one month, it is maybe time to fix it before it is widely exploited. Victor

I plan to commit my fix to Python 3.3 if it is accepted. Then write a simplified version to Python 3.2 and backport it to 3.1.
I'm opposed to any change to the hash values of strings in maintenance releases, so I guess I'm opposed to your patch in principle. See my next message for an alternative proposal.
The vulnerability is public since one month, it is maybe time to fix it before it is widely exploited.
I don't think there is any urgency. The vulnerability has been known for more than five years now. From creating a release to the point where the change actually arrives at end users, many months will pass. Regards, Martin

If randomized hash cannot be turned on by default, an alternative is to switch them off by default, and add an option (command line option, environment variable, etc.) to enable it.
In 2003, Python was not seen as vulnerable. Maybe because the hash function is different than Perl hash function, or because nobody tried to generate collisions. Today it is clear that Python is vulnerable (64 bits version is also affected), and it's really fast to generate collisions using the right algorithm. Why is it so long to fix the vulnerability in Python, whereas it was fixed quickly in Ruby? (they chose to use a randomized hash) Victor

That won't really fix the problem. If people install a new release because it fixes a vulnerability, it better does so.
There is the common vulnerability to the threat of confusing threats with vulnerabilities [1]. Python was vulnerable all the time, and nobody claimed otherwise. It's just that nobody saw it as a threat. I still don't see it as a practical threat, as there are many ways that people use in practice to protect against this threat already. But I understand that others feel threatened now.
Why is it so long to fix the vulnerability in Python, whereas it was fixed quickly in Ruby? (they chose to use a randomized hash)
Because the risk of breakage for Python is much higher than it is for Ruby. Regards, Martin [1] http://jps.anl.gov/Volume4_iss2/Paper3-RGJohnston.pdf

On Tue, Jan 17, 2012 at 12:52 PM, "Martin v. Löwis" <martin@v.loewis.de>wrote:
Please at least consider his patch for 3.3 onwards then. Changing the hash seed per interpreter instance / process is the right thing to do going forward. What to do on maintenance releases is a separate discussion. -gps

Am 18.01.2012 07:06, schrieb Gregory P. Smith:
For 3.3 onwards, I'm skeptical whether all this configuration support is really necessary. I think a much smaller patch which leaves no choice would be more appropriate. Regards, Martin

2012/1/18 "Martin v. Löwis" <martin@v.loewis.de>:
The configuration helps unit testing: see changes on Lib/test/*.py in my last patch. I hesitate to say that the configuration is required for tests. Anyway, users upgrading from Python 3.2 to 3.3 may need to keep the same hash function and don't care of security (e.g. programs running locally with trusted data). Victor

Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having pathological the data needs to be before the collision counter triggers? I'd expect *very* pathological. This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different. That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable. --Guido On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner < victor.stinner@haypocalc.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum <guido@python.org> wrote:
How pathological do you consider the set {1 << n for n in range(2000)} to be? What about the set: ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)} ? The > 2000 elements of the latter set have only 61 distinct hash values on 64-bit machine, so there will be over 2000 total collisions involved in creating this set (though admittedly only around 30 collisions per hash value). -- Mark

On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum <guido@python.org> wrote:
Ah, my bad. It looks like the ieee754_powers_of_two is safe---IIUC, it's the number of collisions involved in a single key-set operation that's limited. So a dictionary with keys {1<<n for n in range(2000)} is fine, but a dictionary with keys {1<<(61*n) for n in range(2000)} is not:
I'd still not consider this particularly pathological, though. -- Mark

Am 14.01.2012 01:37, schrieb Benjamin Peterson:
There are two concerns here: - is it possible to come up with an example of constructed values that show many collisions in a way that poses a threat? To this, the answer is apparently "yes", and the proposed reaction is to hard-limit the number of collisions accepted by the implementation. - then, *assuming* such a limitation is in place: is it possible to come up with a realistic application that would break under this limitation. Mark's example is no such realistic application, instead, it is yet another example demonstrating collisions using constructed values (although the specific example would continue to work fine even under the limitation). A valid counterexample would have to come from a real application, or at least from a scenario that is plausible for a real application. Regards, Martin

Am 13.01.2012 18:08, schrieb Mark Dickinson:
I think this is not a counter-example for the proposed algorithm (at least not in the way I think it should be implemented). Those values may collide on the slot in the set, but they don't collide on the actual hash value. So in order to determine whether the collision limit is exceeded, we shouldn't count colliding slots, but colliding hash values (which we will all encounter during an insert).
though admittedly only around 30 collisions per hash value.
I do consider the case of hashing integers with only one bit set pathological. However, this can be overcome by factoring the magnitude of the number into the hash as well. Regards, Martin

On Thu, 12 Jan 2012 18:57:42 -0800 Guido van Rossum <guido@python.org> wrote:
Breaking due to variable hashing is deterministic: you notice it as soon as you upgrade (and then you use PYTHONHASHSEED to disable variable hashing). That seems better than unpredictable breaking when some legitimate collision chain happens. Regards Antoine.

On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Fair enough. But I'm now uncomfortable with turning this on for bugfix releases. I'm fine with making this the default in 3.3, just not in 3.2, 3.1 or 2.x -- it will break too much code and organizations will have to roll back the release or do extensive testing before installing a bugfix release -- exactly what we *don't* want for those. FWIW, I don't believe in the SafeDict solution -- you never know which dicts you have to change. -- --Guido van Rossum (python.org/~guido)

On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum <guido@python.org> wrote:
Agreed. Of the three options Victor listed only one is good. I don't like *SafeDict*. *-1*. It puts the onerous on the coder to always get everything right with regards to data that came from outside the process never ending up hashed in a non-safe dict or set *anywhere*. "Safe" needs to be the default option for all hash tables. I don't like the "*too many hash collisions*" exception. *-1*. It provides non-deterministic application behavior for data driven applications with no way for them to predict when it'll happen or where and prepare for it. It may work in practice for many applications but is simply odd behavior. I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be back ported to any Python version. It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken. We *will*provide a flag and/or environment variable that can be set to turn the feature off at their own peril which they can use in their test harnesses that are stupid enough to use doctests with order dependencies. This approach worked fine for Perl 9 years ago. https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371 -gps

On 14/01/12 12:58, Gregory P. Smith wrote:
For the record: steve@runes:~$ python -c "print(hash('spam ham'))" -376510515 steve@runes:~$ jython -c "print(hash('spam ham'))" 2054637885 So it is already the case that Python code that assumes stable hashing is broken. For what it's worth, I'm not convinced that we should be overly-concerned by "poor saps" (Guido's words) who rely on accidents of implementation regarding hash. We shouldn't break their code unless we have a good reason, but this strikes me as a good reason. The documentation for hash certainly makes no promise about stability, and relying on it strikes me as about as sensible as relying on the stability of error messages. I'm also not convinced that the option to raise an exception after 1000 collisions actually solves the problem. That relies on the application being re-written to catch the exception and recover from it (how?). Otherwise, all it does is change the attack vector from "cause an indefinite number of hash collisions" to "cause 999 hash collisions followed by crashing the application with an exception", which doesn't strike me as much of an improvement. +1 on random seeding. Default to on in 3.3+ and default to off in older versions, which allows people to avoid breaking their code until they're ready for it to be broken. -- Steven

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg@krypto.org> wrote:
What an implementation looks like: http://pastebin.com/9ydETTag some stuff to be filled in, but this is all that is really required. add logic to allow a particular seed to be specified or forced to 0 from the command line or environment. add the logic to grab random bytes. add the autoconf glue to disable it. done. -gps

I think this statement (and the patch) is wrong. You also need to change the byte string hashing, at least for 2.x. This I consider the biggest flaw in that approach - other people may have written string-like objects which continue to compare equal to a string but now hash different. Regards, Martin

On Sat, 14 Jan 2012 04:45:57 +0100 martin@v.loewis.de wrote:
They're unlikely to have rewritten the hash algorithm by hand - especially given the caveats wrt. differences between Python integers and C integers. Rather, they would have returned the hash() of the equivalent str or unicode object. Regards Antoine.

See the CHAR_HASH macro in http://hg.python.org/cpython/file/e78f00dbd7ae/Modules/expat/xmlparse.c It's not *that* unlikely that more copies of that algorithm exist. Regards, Martin

FWIW the quick change i pastebin'ed is basically covered by the change already under review in http://bugs.python.org/review/13704/show. I've made my comments and suggestions there. I looked into Modules/expat/xmlparse.c and it has an odd copy of the old string hash algorithm entirely for its own internal use and its own internal hash table implementations. That module is likely vulnerable to creatively crafted documents for the same reason. With 13704 and the public API it provides to get the random hash seed, that module could simply be updated to use that in its own hash implementation. As for when to enable it or not, I unfortunately have to agree, despite my wild desires we can't turn on the hash randomization change by default in anything prior to 3.3. -gps On Sat, Jan 14, 2012 at 11:17 AM, Gregory P. Smith <greg@krypto.org> wrote:

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith <greg@krypto.org> wrote:
No, that is not how we usually take compatibility between bugfix releases. "Your code is already broken" is not an argument to break forcefully what worked (even if by happenstance) before. The difference between CPython and Jython (or between different CPython feature releases) also isn't relevant -- historically we have often bent over backwards to avoid changing behavior that was technically undefined, if we believed it would affect a significant fraction of users. I don't think anyone doubts that this will break lots of code (at least, the arguments I've heard have been "their code is broken", not "nobody does that"). This approach worked fine for Perl 9 years ago.
I don't know what the Perl attitude about breaking undefined behavior between micro versions was at the time. But ours is pretty clear -- don't do it. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum wrote:
I don't know about "lots" of code, but it will break at least one library (or so I'm told): http://mail.python.org/pipermail/python-list/2012-January/1286535.html -- Steven

I don't think that it would be hard to patch this library to use another hash function. It can implement its own hash function, use MD5, SHA1, or anything else. hash() is not stable accross Python versions and 32/64 bit systems. Victor 2012/1/15 Hynek Schlawack <hs@ox.cx>:

Am 15.01.2012 15:27, schrieb Victor Stinner:
As I wrote in a reply further down: no, it isn't hard to change this behaviour (and I find the current caching system, which uses hash() on an URL to choose the cache index, braindead to begin with), but, as with all other considerations: the current version of the library, with the default options, depends on hash() to be stable for the cache to make any sense at all (and especially with "generic" schema such as the referenced xml.dtd, caching makes a lot of sense, and not being able to cache _breaks_ applications as it did mine). This is juts something to bear in mind. -- --- Heiko.

On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
It is perfectly okay to break existing users who had anything depending on ordering of internal hash tables. Their code was already broken.
Given that the doc says "Return the hash value of the object", I do not think we should be so hard-nosed. The above clearly implies that there is such a thing as *the* Python hash value for an object. And indeed, that has been true across many versions. If we had written "Return a hash value for the object, which can vary from run to run", the case would be different. -- Terry Jan Reedy

Terry Reedy, 14.01.2012 06:43:
Just a side note, but I don't think hash() is the right place to document this. Hashing is a protocol in Python, just like indexing or iteration. Nothing keeps an object from changing its hash value due to modification, and that would even be valid in the face of the usual dict lookup invariants if changes are only applied while the object is not referenced by any dict. So the guarantees do not depend on the function hash() and may be even weaker than your above statement. Stefan

On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
You mean we shouldn't document that the hash() of a string will vary per run?
Hashing is a protocol in Python, just like indexing or iteration. Nothing keeps an object from changing its hash value due to modification,
Eh? There's a huge body of cultural awareness that only immutable objects should define a hash, implying that the hash remains constant during the object's lifetime.
And how would you know it isn't?
So the guarantees do not depend on the function hash() and may be even weaker than your above statement.
There are no actual guarantees for hash(), but lots of rules for well-behaved hashes. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum, 15.01.2012 17:10:
No, I mean that the hash() builtin function is not the right place to document the behaviour of a string hash. That should go into the string object documentation. Although, arguably, it may be worth mentioning in the docs of hash() that, in general, hash values of builtin types are bound to the lifetime of the interpreter instance (or entire runtime?) and may change after restarts. I think that's a reasonable restriction to document that prominently, even if it will only apply to str for the time being.
Well, if it's an object with a mutable hash then it's up to the application defining that object to make sure it's used in a sensible way. Immutability just makes your life easier. I can imagine that an object gets removed from a dict (say, a cache), modified and then reinserted, and I think it's valid to allow the modification to have an impact on the hash in this case, in order to accommodate for any changes to equality comparisons due to the modification. That being said, it seems that the Python docs actually consider constant hashes a requirement rather than a virtue. http://docs.python.org/glossary.html#term-hashable """ An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value. """ It also seems to me that the wording "has a hash value which never changes during its lifetime" makes it pretty clear that the lifetime of the hash value is not guaranteed to supersede the lifetime of the object (although that's a rather muddy definition - memory lifetime? or pickle-unpickle as well?). However, this entry in the glossary only seems to have appeared with Py2.6, likely as a result of the abc changes. So it won't help in defending a change to the hash function.
Absolutely. Stefan

On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Lifetime to me means of that specific instance of the object. I would not expect that to survive pickle-unpickle.
Ugh, I really hope there is no code out there depending on the hash function being the same across a pickle and unpickle boundary. Unfortunately the hash function was last changed in 1996 in http://hg.python.org/cpython/rev/839f72610ae1 so it is possible someone somewhere has written code blindly assuming that non-guarantee is true. -gps

On Sun, 15 Jan 2012 17:46:36 +0100 Stefan Behnel <stefan_ml@behnel.de> wrote:
No, but we can document that *any* hash() value can vary between runs without being specific about which builtin types randomize their hashes right now. Regards Antoine.

On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Actually it will apply to a lot more than str, because the hash of (immutable) compound objects is often derived from the hash of the constituents, e.g. hash of a tuple.
That could be considered valid only in a very abstract, theoretical, non-constructive way, since there is no protocol to detect removal from a dict (and you cannot assume an object is used in only one dict at a time).
Across pickle-unpickle it's not considered the same object. Pickling at best preserves values. However, this entry in the glossary only seems to have appeared with Py2.6,
-- --Guido van Rossum (python.org/~guido)

On Sun, Jan 15, 2012 at 9:44 AM, Guido van Rossum <guido@python.org> wrote:
Updating the docs to explicitly clarify this sounds like a good idea. How does this wording to be added to the glossary.rst hashing section sound? """Hash values may not be stable across Python processes and must not be used for storage or otherwise communicated outside of a single Python session.""" -gps

On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum <guido@python.org> wrote:
Python's dicts are designed to avoid hash conflicts by resizing and keeping the available slots bountiful. 1000 conflicts sounds like a number that couldn't be hit accidentally unless you had a single dict using a terabyte of RAM (i.e. if Titus Brown doesn't object, we're good). The hashes also look to exploit cache locality but that is very unlikely to get one thousand conflicts by chance. If you get that many there is an attack.
The patch counts conflicts on an individual insert and not lifetime conflicts. Looks sane to me.
That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.
Agreed. Paranoid people can turn the behavior off and if it ever were to become a problem in practice we could point people to a solution. -Jack

On Sat, Jan 14, 2012 at 4:24 PM, Jack Diederich <jackdied@gmail.com> wrote:
Having a hard limit on the worst-case behaviour certainly sounds like an attractive prospect. And there's nothing to worry about in terms of secrecy or sufficient randomness - by default, attackers cannot generate more than 1000 hash collisions in one lookup, period.
Does MAL's patch allow the limit to be set on a per-dict basis (including setting it to None to disable collision limiting completely)? If people have data sets that need to tolerate that kind of collision level (and haven't already decided to move to a data structure other than the builtin dict), then it may make sense to allow them to remove the limit when using trusted input. For maintenance versions though, it would definitely need to be possible to switch it off without touching the code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Jack Diederich writes:
I may be missing something, but AIUI, with the resize, the search for an unused slot after collision will be looking in a different series of slots, so the N counter for the N^2 behavior resets on resize. If not, you can delete this message now. If so, since (a) in the error-on-many-collisions approach we're adding a test here for collision count anyway and (b) we think this is almost never gonna happen, can't we defuse the exploit by just resizing the dict after 1000 collisions, with strictly better performance than the error approach, and almost current performance for "normal" input? In order to prevent attackers from exploiting every 1000th collision to force out-of-memory, the expansion factor for collision-induced resizing could be "very small". (I don't know if that's possible in the Python dict implementation, if the algorithm requires something like doubling the dict size on every resize this is right out, of course.) Or, since this is an error/rare path anyway, offer the user a choice of an error or a resize on hitting 1000 collisions?

Am 13.01.2012 02:24, schrieb Victor Stinner:
collisions on one core with no precomputed data. I'm sure it's possible to make this less than a second.
In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^ suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is possible. So the question is: How difficult is it to guess the seed? Frank

Unfortunately it requires only a few seconds to compute enough 32bit collisions on one core with no precomputed data.
Are you running the hash function "backward" to generate strings with the same value, or you are more trying something like brute forcing? And how do you get the hash secret? You need it to run an attack.
My change adds also a prefix (a prefix and a suffix). I don't know if it changes anything for generating collisions.
So the question is: How difficult is it to guess the seed?
I wrote some remarks about that in the issue. For example: (hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) & HASH_MASK) ^ ((prefix * 1000003**2) & HASH_MASK) I suppose that you don't have directly the full output of hash(str) in practical, but hash(str) & DICT_MASK where DICT_MASK depends is the size of the internal dict array minus 1. For example, for a dictionary of 65,536 items, the mask is 0x1ffff and so cannot gives you more than 17 bits of hash(str) output. I still don't know how difficult it is to retreive hash(str) bits from repr(dict). Victor

If you try it brute force to hit a specific target, you'll only find only one good string every 4 billion tries. That's why you first blow up your target: You start backward from an arbitrary target-value. You brute force for 3 characters, for example, this will give you 16 million intermediate values from which you know that they'll end up in your target-value. Those 16 million values are a huge target for now brute-forcing forward: Every 256 tries you'll hit one of these values.
And how do you get the hash secret? You need it to run an attack.
I don't know. This was meant as an answer to the quoted text "So I hope that computing collisions requires a lot of CPU time (is slow) to make the attack ineffective with today computers.". What I wanted to say is: The security relies on the fact that the attacker can't guess the prefix, not that he can't precompute the values and it takes hours or days to compute the collisions. If the prefix leaks out of the application, then the rest is trivial and done in a few seconds. The suffix is not important for the collision-prevention, but it will probably make it much harder to guess the prefix. I don't know an effective way to get the prefix either, (if the application doesn't leak full hash(X) values). Frank

On Fri, Jan 13, 2012 at 02:24, Victor Stinner <victor.stinner@haypocalc.com> wrote:
This is my preferred solution. The vulnerability is basically only in the dictionary you keep the form data you get from a request. This solves it easily and nicely. It can also be a separate module installable for Python 2, which many web frameworks still use, so it can be practical implementable now, and not in a couple of years. Then again, nothing prevents us from having both this, *and* one of the other solutions. :-) //Lennart

On 2012-01-13 11:20, Lennart Regebro wrote:
The vulnerability is basically only in the dictionary you keep the form data you get from a request.
I'd have to disagree with this statement. The vulnerability is anywhere that creates a dictionary (or set) from attacker-provided keys. That would include HTTP headers, RFC822-family subheaders and parameters, the environ, input taken from JSON or XML, and so on - and indeed hash collision attacks are not at all web-specific. The problem with having two dict implementations is that a caller would have to tell libraries that use dictionaries which implementation to use. So for example an argument would have to be passed to json.load[s] to specify whether the input was known-sane or potentially hostile. Any library could ever use dictionaries to process untrusted input *or any library that used another library that did* would have to pass such a flag through, which would quickly get very unwieldy indeed... or else they'd have to just always use safedict, in which case we're in pretty much the same position as we are with changing dict anyway. -- And Clover mailto:and@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobince@gmail.com

We could mix Marc's collision counter with SafeDict idea (being able to use a different secret for each dict): use hash(key, secret) (simple example: hash(secret+key)) instead of hash(key) in dict (and set), and change the secret if we have more than N collisions. But it would slow down all dict lookup (dict creation, get, set, del, ...). And getting new random data can also be slow. SafeDict and hash(secret+key) lose the benefit of the cached hash result. Because the hash result depends on a argument, we cannot cache the result anymore, and we have to recompute the hash for each lookup (even if you lookup the same key twice ore more). Victor

On 1/13/2012 5:35 PM, Victor Stinner wrote:
So integrating SafeDict into dict so it could be automatically converted would mean changing the data structures underneath dict. Given that, a technique for hash caching could be created, that isn't quite as good as the one in place, but may be less expensive than not caching the hashes. It would also take more space, a second dict, internally, as well as the secret. So once the collision counter reaches some threshold (since there would be a functional fallback, it could be much lower than 1000), the secret is obtained, and the keys are rehashed using hash(secret+key). Now when lookups occur, the object id of the key and the hash of the key are used as the index and hash(secret+key) is stored as a cached value. This would only benefit lookups by the same object, other objects with the same key value would be recalculated (at least the first time). Some limit on the number of cached values would probably be appropriate. This would add complexity, of course, in trying to save time. An alternate solution would be to convert a dict to a tree once the number of collisions produces poor performance. Converting to a tree would result in O(log N) instead of O(1) lookup performance, but that is better than the degenerate case of O(N) which is produced by the excessive number of collisions resulting from an attack. This would require new tree code to be included in the core, of course, probably a red-black tree, which stays balanced. In either of these cases, the conversion is expensive, because a collision threshold must first be reached to determine the need for conversion, so the hash could already contain lots of data. If it were too expensive, the attack could still be effective. Another solution would be to change the collision code, so that colliding keys don't produce O(N) behavior, but some other behavior. Each colliding entry could convert that entry to a tree of entries, perhaps. This would require no conversion of "bad dicts", and an attack could at worst convert O(1) performance to O(log N). Clearly these ideas are more complex than adding randomization, but adding randomization doesn't seem to be produce immunity from attack, when data about the randomness is leaked.

Which will not normally happen. I'm firmly in the camp that believes the random seed can be probed and determined by creatively injecting values and measuring timing of things. But doing that is difficult and time and bandwidth intensive so the per process random hash seed is good enough. There's another elephant in the room here, if you want to avoid this attack use a 64-bit Python build as it uses 64-bit hash values that are significantly more difficult to force a collision on. -gps

btw, Tim's commit message on this one is amusingly relevant. :) http://hg.python.org/cpython/diff/8d2bbbbf2cb9/Objects/dictobject.c On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith <greg@krypto.org> wrote:

Victor Stinner wrote:
Am I missing something? How does this fix the vulnerability? It seems to me that the only thing this does is turn one sort of DOS attack into another sort of DOS attack: hostile users will just cause hash collisions until an exception is raised and the application falls over. Catching these exceptions, and recovering from them (how?), would be the responsibility of the application author. Given that developers are unlikely to ever see 1000 collisions by accident, or even realise that it could happen, I don't expect that many people will do that -- until they personally get bitten. -- Steven

On Sun, Jan 15, 2012 at 2:42 PM, Steven D'Aprano <steve@pearwood.info> wrote:
As I understand it, the way the attack works is that a *single* malicious request from the attacker can DoS the server by eating CPU resources while evaluating a massive collision chain induced in a dict by attacker supplied data. Explicitly truncating the collision chain boots them out almost immediately (likely with a 500 response for an internal server error), so they no longer affect other events, threads and processes on the same machine. In some ways, the idea is analogous to the way we implement explicit recursion limiting in an attempt to avoid actually blowing the C stack - we take a hard-to-detect-and-hard-to-handle situation (i.e. blowing the C stack or malicious generation of long collision chains in a dict) and replace it with something that is easy to detect and can be handled by normal exception processing (i.e. a recursion depth exception or one reporting an excessive number of slot collisions in a dict lookup). That then makes the default dict implementation safe from this kind of attack by default, and use cases that are getting that many collisions legitimately can be handled in one of two ways: - switch to a more appropriate data type (if you're getting that many collisions with benign data, a dict is probably the wrong container to be using) - offer a mechanism (command line switch or environment variable) to turn the collision limiting off Now, where you can still potentially run into problems is if a single shared dict is used to store both benign and malicious data - if the malicious data makes it into the destination dict before the exception finally gets triggered, and then benign data also happens to trigger the same collision chain, then yes, the entire app may fall over. However, such an app would have been crippled by the original DoS anyway, since its performance would have been gutted - the collision chain limiting just means it will trigger exceptions for the cases that would been insanely slow. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

This is only true in the specific attack presented at 28c3. If an attacker can insert data without triggering the attack, it's possible to produce (in the example of a web application) urls that (regardless of the request) always produce pathological behavior. For example, a collection of pathological usernames might make it impossible to list users (and so choose which ones to delete) without resorting to removing the problem data at an SQL level. This is why the "simply throw an error" solution isn't a complete fix. Making portions of an interface unusable for regular users is clearly a bad thing, and is clearly applicable to other types of poisoned data as well. We need to detect collisions and work around them transparently.
We can do better than saying "it would have been broken before, it's broken differently now". The universal hash function idea has merit, and for practical purposes hash randomization would fix this too (since colliding data is only likely to collide within a single process, persistent poisoning is far less feasible). -Paul

On 17 January 2012 09:23, Paul McMillan <paul@mcmillan.ws> wrote:
What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount? Should be transparent, have neglible speed penalty, maximal reuse of existing code, and should be very difficult to attack since the dictionary would change size in a (near) non-deterministic manner when being attacked (i.e. first attack causes non-deterministic remap, next attack should fail). It should also have near-zero effect on existing tests and frameworks since we would only get the non-deterministic behaviour in pathological cases, which we would presumably need new tests for. Thoughts? Tim Delaney

On 17 January 2012 10:14, Tim Delaney <timothy.c.delaney@gmail.com> wrote:
And one thought I had immediately after hitting send is that there could be an attack of the form "build a huge dict, then hit it with something that causes it to rehash due to >1000 collisions". But that's not really going to be any worse than just building a huge dict and hitting a resize anyway. Tim Delaney

2012/1/17 Tim Delaney <timothy.c.delaney@gmail.com>:
What if in a pathological collision (e.g. > 1000 collisions), we increased the size of a dict by a small but random amount?
It doesn't change anything, you will still get collisions. Victor

On Mon, Jan 16, 2012 at 4:16 PM, Victor Stinner < victor.stinner@haypocalc.com> wrote:
That depends right? If the collision is because they all have the same hash(), yes. It might be different if it is because the secondary hashing (or whatever it's called :-) causes collisions. -- --Guido van Rossum (python.org/~guido)

But Python deals with the latter case just fine already. The open hashing approach relies on the dict resizing "enough" to prevent collisions after the dictionary has grown. Unless somebody can demonstrate a counter example, I believe this discussion is a red herring. Plus: if an attacker could craft keys that deliberately cause collisions because of the dictionary size, they could likely also craft keys in the same number that collide on actual hash values, bringing us back to the original problem. Regards, Martin

martin@v.loewis.de writes:
I thought that the original problem was that with N insertions in the dictionary, by repeatedly inserting different keys generating the same hash value an attacker could arrange that the cost of finding an open slot is O(N), and thus the cost of N insertions is O(N^2). If so, frequent resizing could make the attacker's problem much more difficult, as the distribution of secondary probes should change with each resize.

The attack creates 60,000 strings (or more) with exactly the same hash value. A dictionary uses hash(str) & DICT_MASK to compute the bucket index, where DICT_HASH is the number of buckets minus one. If all strings have the same hash value, we always start in the same bucket and the key has to be compared to all previous strings to find the next empty bucket. The attack works because a LOT of strings are compared and comparing strings is slow. If hash(str1)&DICT_MASK == hash(str2)&DICT_MASK but hash(str1)!=hash(str2), strings are not compared (this is a common optimization in Python), and the so the attack would not be successful (it would be slow, but not as slow as comparing two strings). Victor

Not sure what you mean by "distribution of secondary probes". Let H be the initial hash value, and let MASK be the current size of the dictionary. Then I(n), the sequence of dictionary indices being probed, is computed as I(0) = H & MASK PERTURB(0) = H I(n+1) = (5*I(n) + 1 + PERTURB(n)) & MASK PERTURN(n+1) = PERTURB(n) >> 5 So if two objects O1 and O2 have the same hash value H, the sequence of probed indices is the same for any MASK value. It will be a different sequence, yes, but they will still collide on each and every slot. This is the very nature of open addressing. If it *wouldn't* try all indices in the probe sequence, it may not be possible to perform the lookup for a key correctly. Regards, Martin

On 01/17/2012 09:29 PM, "Martin v. Löwis" wrote:
Open addressing can still deploy a collision resolution mechanism without this property. For example, double hashing uses a different hash function (applied to the key) to calculate PERTURB(0). To defeat it, the attacker would have to produce keys that hash the same using both hash functions. Double hashing is not a good general solution for Python dicts because it complicates the interface of hash tables that support arbitrary keys. Still, it could be considered for dicts with known key types (built-ins could hardcode the alternative hash function) or for SafeDicts, if they are still considered. Hrvoje

I finished my patch transforming hash(str) to a randomized hash function, see random-8.patch attached to the issue: http://bugs.python.org/issue13703 The remaining question is which random number generator should be used on Windows to initialize the hash secret (CryptoGen adds an overhead of 10%, at least when the DLL is loaded dynamically), read the issue for the details. I plan to commit my fix to Python 3.3 if it is accepted. Then write a simplified version to Python 3.2 and backport it to 3.1. Then backport the simplified fix to 2.7, and finally to 2.6. The vulnerability is public since one month, it is maybe time to fix it before it is widely exploited. Victor

I plan to commit my fix to Python 3.3 if it is accepted. Then write a simplified version to Python 3.2 and backport it to 3.1.
I'm opposed to any change to the hash values of strings in maintenance releases, so I guess I'm opposed to your patch in principle. See my next message for an alternative proposal.
The vulnerability is public since one month, it is maybe time to fix it before it is widely exploited.
I don't think there is any urgency. The vulnerability has been known for more than five years now. From creating a release to the point where the change actually arrives at end users, many months will pass. Regards, Martin

If randomized hash cannot be turned on by default, an alternative is to switch them off by default, and add an option (command line option, environment variable, etc.) to enable it.
In 2003, Python was not seen as vulnerable. Maybe because the hash function is different than Perl hash function, or because nobody tried to generate collisions. Today it is clear that Python is vulnerable (64 bits version is also affected), and it's really fast to generate collisions using the right algorithm. Why is it so long to fix the vulnerability in Python, whereas it was fixed quickly in Ruby? (they chose to use a randomized hash) Victor

That won't really fix the problem. If people install a new release because it fixes a vulnerability, it better does so.
There is the common vulnerability to the threat of confusing threats with vulnerabilities [1]. Python was vulnerable all the time, and nobody claimed otherwise. It's just that nobody saw it as a threat. I still don't see it as a practical threat, as there are many ways that people use in practice to protect against this threat already. But I understand that others feel threatened now.
Why is it so long to fix the vulnerability in Python, whereas it was fixed quickly in Ruby? (they chose to use a randomized hash)
Because the risk of breakage for Python is much higher than it is for Ruby. Regards, Martin [1] http://jps.anl.gov/Volume4_iss2/Paper3-RGJohnston.pdf

On Tue, Jan 17, 2012 at 12:52 PM, "Martin v. Löwis" <martin@v.loewis.de>wrote:
Please at least consider his patch for 3.3 onwards then. Changing the hash seed per interpreter instance / process is the right thing to do going forward. What to do on maintenance releases is a separate discussion. -gps

Am 18.01.2012 07:06, schrieb Gregory P. Smith:
For 3.3 onwards, I'm skeptical whether all this configuration support is really necessary. I think a much smaller patch which leaves no choice would be more appropriate. Regards, Martin

2012/1/18 "Martin v. Löwis" <martin@v.loewis.de>:
The configuration helps unit testing: see changes on Lib/test/*.py in my last patch. I hesitate to say that the configuration is required for tests. Anyway, users upgrading from Python 3.2 to 3.3 may need to keep the same hash function and don't care of security (e.g. programs running locally with trusted data). Victor
participants (26)
-
"Martin v. Löwis"
-
And Clover
-
Antoine Pitrou
-
Barry Warsaw
-
Benjamin Peterson
-
Frank Sievertsen
-
Frank Sievertsen
-
Glenn Linderman
-
Gregory P. Smith
-
Guido van Rossum
-
Heiko Wundram
-
Hrvoje Niksic
-
Hynek Schlawack
-
Jack Diederich
-
Jeremy Sanders
-
Lennart Regebro
-
Mark Dickinson
-
martin@v.loewis.de
-
Nick Coghlan
-
Paul McMillan
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Victor Stinner