Hash collision security issue (now public)

Hello all, A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included: http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_... Although it's a security issue I'm posting it here because it is now public and seems important. The issue they report can cause (for example) handling an http post to consume horrible amounts of cpu. For Python the figures they quoted: reasonable-sized attack strings only for 32 bits Plone has max. POST size of 1 MB 7 minutes of CPU usage for a 1 MB request ~20 kbits/s → keep one Core Duo core busy This was apparently reported to the security list, but hasn't been responded to beyond an acknowledgement on November 24th (the original report didn't make it onto the security list because it was held in a moderation queue). The same vulnerability was reported against various languages and web frameworks, and is already fixed in some of them. Their recommended fix is to randomize the hash function. All the best, Michael -- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On Wednesday, December 28, 2011 at 8:28 PM, Michael Foord wrote:
Back up link for the PDF: http://dl.dropbox.com/u/1374/2007_28C3_Effective_DoS_on_web_application_plat... Ocert disclosure: http://www.ocert.org/advisories/ocert-2011-003.html jesse

On Wednesday, December 28, 2011 at 8:37 PM, Jesse Noller wrote:
And more analysis/information: http://cryptanalysis.eu/blog/2011/12/28/effective-dos-attacks-against-web-ap...

A few thoughts on this: a) This is not a new issue, I'm curious what the new interest is in it. b) Whatever the solution to this is, it is *not* CPython specific, any decision should be reflected in the Python language spec IMO, if CPython has the semantic that dicts aren't vulnerable to hash collision then users *will* rely on this and another implementation having a different (valid) behavior opens up users to security issues. c) I'm not convinced a randomized hash is appropriate for the default dict, for a number of reasons: it's a performance hit on every dict operations, using a per-process seed means you can't compile the hash of an obj at Python's compile time, a per-dict seed inhibits a bunch of other optimizations. These may not be relevant to CPython, but they are to PyPy and probably the invoke-dynamic work on Jython (pursuant to point b). Therefore I think these should be considered application issues, since request limiting is difficult and error prone, I'd recommend the Python stdlib including a non-hash based map (such as a binary tree). Alex

On Wed, Dec 28, 2011 at 19:51, Alex Gaynor <alex.gaynor@gmail.com> wrote:
A few thoughts on this:
a) This is not a new issue, I'm curious what the new interest is in it.
Well they (the presenters of the report) had to be accepted to that conference for *something*, otherwise we wouldn't know they exist.

Am 29.12.2011 02:37, schrieb Jesse Noller:
--- Python uses a hash function which is very similar to DJBX33X, which can be broken using a meet-in-the-middle attack. It operates on register size and is thus different for 64 and 32 bit machines. While generating multi-collisions efficiently is also possible for the 64 bit version of the function, the resulting colliding strings are too large to be relevant for anything more than an academic attack. Plone as the most prominent Python web framework accepts 1 MB of POST data, which it parses in about 7 minutes of CPU time in the worst case. This gives an attacker with about 20 kbit/s the possibility to keep one Core Duo core constantly busy. If the attacker is in the position to have a Gigabit line available, he can keep about 50.000 Core Duo cores busy. --- If I remember correctly CPython uses the long for its hash function so 64bit Windows uses a 32bit hash.

On Wed, Dec 28, 2011 at 5:37 PM, Jesse Noller <jnoller@gmail.com> wrote:
Discussion of hash functions in general: http://burtleburtle.net/bob/hash/doobs.html Two of the best hash functions that currently exist: http://code.google.com/p/cityhash/ and http://code.google.com/p/smhasher/wiki/MurmurHash. I'm not sure exactly what problem the paper is primarily complaining about: 1. Multiply+add and multiply+xor hashes are weak: this would be solved by changing to either of the better-and-faster hashes I linked to above. On the other hand: http://mail.python.org/pipermail/python-3000/2007-September/010327.html 2. It's possible to find collisions in any hash function in a 32-bit space: only solved by picking a varying seed at startup or compile time. If you decide to change to a stronger hash function overall, it might also be useful to change the advice "to somehow mix together (e.g. using exclusive or) the hash values for the components" in http://docs.python.org/py3k/reference/datamodel.html#object.__hash__. hash(tuple(components)) will likely be better if tuple's hash is improved. Hash functions are already unstable across Python versions. Making them unstable across interpreter processes (multiprocessing doesn't share dicts, right?) doesn't sound like a big additional problem. Users who want a distributed hash table will need to pull their own hash function out of hashlib or re-implement a non-cryptographic hash instead of using the built-in one, but they probably need to do that already to allow themselves to upgrade Python. Jeffrey

On Sat, Dec 31, 2011 at 4:04 PM, Jeffrey Yasskin <jyasskin@gmail.com> wrote:
Here's an idea. Suppose we add a sys.hash_seed or some such, that's settable to an int, and defaults to whatever we're using now. Then programs that want a fix can just set it to a random number, and on Python versions that support it, it takes effect. Everywhere else it's a silent no-op. Downside: sys has to have slots for this to work; does sys actually have slots? My memory's hazy on that. I guess actually it'd have to be sys.set_hash_seed(). But same basic idea. Anyway, this would make fixing the problem *possible*, while still pushing off the hard decisions to the app/framework developers. ;-) Downside: every hash operation includes one extra memory access, but strings only compute their hash once anyway.) Given that changing dict won't help, and changing the default hash is a non-starter, an option to set the seed is probably the way to go. (Maybe with an environment variable and/or command line option so users can work around old code.)

ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). The seed should be unique per process except it should survive fork() (but not exec()). I'm not worried about unrelated processes needing to have the same hash(), but I'm not against offering an env variable or command line flag to force the seed. I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash (as long as they can't force the seed -- this actually argues slightly *against* offering a way to force the seed, except that we have strong backwards compatibility requirements). We need to fix this as far back as Python 2.6, and it would be nice if a source patch was available that works on Python 2.5 -- personally I do have a need for a 2.5 fix and if nobody creates one I will probably end up backporting the fix from 2.6 to 2.5. Is there a tracker issue yet? The discussion should probably move there. PS. I would propose a specific fix but I can't seem to build a working CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this error late in the build: ./python.exe -SE -m sysconfig --generate-posix-vars Fatal Python error: Py_Initialize: can't initialize sys standard streams Traceback (most recent call last): File "/Users/guido/cpython/Lib/io.py", line 60, in <module> make: *** [Lib/_sysconfigdata.py] Abort trap -- --Guido van Rossum (python.org/~guido)

On Sat, Dec 31, 2011 at 4:56 PM, Guido van Rossum <guido@python.org> wrote:
FWIW I managed to build Python 2.6, and a trivial mutation of the string/unicode hash function (add 1 initially) made only three tests fail; test_symtable and test_json both have a dependency on dictionary order, test_ctypes I can't quite figure out what's going on. Oh, and an unrelated failure in test_sqlite: File "/Users/guido/pythons/p26/Lib/sqlite3/test/types.py", line 355, in CheckSqlTimestamp self.failUnlessEqual(ts.year, now.year) AssertionError: 2012 != 2011 I betcha that's because it's still 2011 here in Texas but already 2012 in UTC-land. Happy New Year everyone! :-) -- --Guido van Rossum (python.org/~guido)

This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process. The goal isn't perfection, but we need to do better than a simple salt. I propose we modify the string hash function like this: https://gist.github.com/0a91e52efa74f61858b5 This code is based on PyPy's implementation, but the concept is universal. Rather than choosing a single short random seed per process, we generate a much larger random seed (r). As we hash, we deterministically choose a portion of that seed and incorporate it into the hash process. This modification is a minimally intrusive change to the existing hash function, and so should not introduce unexpected side effects which might come from switching to a different class of hash functions. I've worked through this code with Alex Gaynor, Antoine Pitrou, and Victor Stinner, and have asked several mathematicians and security experts to review the concept. The reviewers who have gotten back to me thus far have agreed that if the initial random seed is not flawed, this should not overly change the properties of the hash function, but should make it quite difficult for an attacker to deduce the necessary information to predictably cause hash collisions. This function is not designed to protect against timing attacks, but should be nontrivial to reverse even with access to timing data. Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. This is probably an acceptable trade off, and actually provides better performance in the case of short strings than a high-entropy fixed-length seed prefix. -Paul

On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan <paul@mcmillan.ws> wrote:
Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced.
The goal isn't perfection, but we need to do better than a simple salt.
Perhaps.
I'm not sure I understand this. What's the worry about "a different class of hash functions"? (It may be clear that I do not have a deep mathematical understanding of hash functions.)
I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Let's worry about timing attacks another time okay?
Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. But I like the idea of a bigger random seed from which we deterministically pick some part. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it? -- --Guido van Rossum (python.org/~guido)

I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference.
This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs.
I forget -- what do we do on systems without urandom()? (E.g. Windows?) Windows has CryptGenRandom which is approximately equivalent.
Let's worry about timing attacks another time okay? Agreed. As long as there isn't a gaping hole, I'm fine with that.
Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed.
But I like the idea of a bigger random seed from which we deterministically pick some part.
Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of "just brute force the key".
We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. -Paul

On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <paul@mcmillan.ws> wrote:
Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer.
Agreed.
Yup.
But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input. It is important that this function touches every
character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest.
I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.
Still, much more work for the attacker.
I like that.
Sure. Thanks for thinking about all the details here!! -- --Guido van Rossum (python.org/~guido)

Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. -- --Guido van Rossum (python.org/~guido)

PS. Is the collision-generator used in the attack code open source? -- --Guido van Rossum (python.org/~guido)

On 1/1/2012 10:13 AM, Guido van Rossum wrote:
PS. Is the collision-generator used in the attack code open source?
As I posted before, Alexander Klink and Julian Wälde gave their project email as hashDoS@alech.de. Since they indicated disappointment in not hearing from Python, I presume they would welcome engagement. -- Terry Jan Reedy

Am 01.01.2012 19:45, schrieb Terry Reedy:
Somebody should contact Alexander and Julian to let them know, that we are working on the matter. It should be somebody "official" for the initial contact, too. I've included Guido (BDFL), Barry (their initial security contact) and MvL (most prominent German core dev) in CC, as they are the logical choice for me. I'm willing to have a phone call with them once the contact has been established. IMHO it's slightly easier to talk in native tongue -- Alexander and Julian are German, too. Christian

On 01/02/2012 04:47 PM, Christian Heimes wrote:
I wouldn't expect too much -- they seem rather keen on cheap laughs: http://twitter.com/#!/bk3n/status/152068096448921600/photo/1/large Georg

On Jan 02, 2012, at 06:38 PM, Georg Brandl wrote:
I wouldn't expect too much -- they seem rather keen on cheap laughs:
http://twitter.com/#!/bk3n/status/152068096448921600/photo/1/large
Heh, so yeah, it won't be me contacting them. -Barry

Am 01.01.2012 16:13, schrieb Guido van Rossum:
I had the same concern as you and was worried that projects like ZODB might require a stable hash function. Fred already stated that ZODB doesn't use the hash in its btree structures. Possible solutions: * make it possible to provide the seed as an env var * disable randomizing as default setting or at least add an option to disable randomization IMHO the issue needs a PEP that explains the issue, shows all possible solutions and describes how we have solved the issue. I'm willing to start a PEP. Who likes to be the co-author? Christian

I agree that there are use cases for allowing users to choose the random seed, in much the same way it's helpful to be able to set it for the random number generator. This should probably be something that can be passed in at runtime. This feature would also be useful for users who want to synchronize the hashes of multiple independent processes, for whatever reason. For the general case though, randomization should be on by default. -Paul

But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.
That's a fair point. If we go down that avenue, I think simply choosing a random fixed starting value for x is the correct choice, rather than computing an intermediate value.
Yep, agreed. My original proposal did not adequately address this.
I believe this is a reasonable solution. An attacker could still manipulate the internal state of long strings, but the additional information at both ends should make that difficult to exploit. I'll talk it over with the reviewers.
Again, I'll talk to the reviewers (and run the randomness test battery) to be double-check that this doesn't affect the distribution in some unexpected way, but I think it should be fine.
PS. Is the collision-generator used in the attack code open source?
Not in working form, and they've turned down requests for it from other projects that want to check their work. If it's imperative that we have one, I can write one, but I'd rather not spend the effort if we don't need it. -Paul

Am 01.01.2012 06:57, schrieb Paul McMillan:
Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2 MB (2**21 - 1) data from urandom. I'm worried that this is going to exhaust the OS's random pool and suck it dry. We shouldn't forget that Python is used for long running processes as well as short scripts. Your suggestion also increases the process size by 2 MB which is going to be an issue for mobile and embedded platforms. How about this: r = [ord(i) for i in os.urandom(256)] rs = os.urandom(4) # or 8 ? seed = rs[-1] + (rs[-2] << 8) + (rs[-3] << 16) + (rs[-4] << 24) def _hash_string(s): """The algorithm behind compute_hash() for a string or a unicode.""" from pypy.rlib.rarithmetic import intmask length = len(s) if length == 0: return -1 x = intmask(seed + (ord(s[0]) << 7)) i = 0 while i < length: o = ord(s[i]) x = intmask((1000003*x) ^ o ^ r[o % 0xff] i += 1 x ^= length return intmask(x) This combines a random seed for the hash with your suggestion. We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The randomization shouldn't be used if we can prove that it's not possible to create hash collisions for strings shorter than X. For example 64bit FNV-1 has no collisions for 8 chars or less, 32bit FNV has no collisions for 4 or less cars. Christian Christian

I fixed a couple things in my proposed algorithm: https://gist.github.com/0a91e52efa74f61858b5 I had a typo, and used 21 instead of 12 for the size multiplier. We definitely don't need 2MB random data. The initialization of r was broken. Now it is an array of ints, so there's no conversion when it's used. I've adjusted it so there's 8k of random data, broken into 2048 ints. I added a length-based seed to the initial value of x. This prevents single-characters from being used to enumerate raw values from r. This is similar to the change proposed by Christian Heimes. Most importantly, I moved the xor with r[x % len_r] down a line. Before, it wasn't being applied to the last character.
The updated code always includes at least 2 lookups from r, which I believe solves the single-character enumeration problem. If we special-case part of our hash function for short strings, we may get suboptimal collisions between the two types of hashes. I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function. For the record, cryptographically strong hash functions are in the neighborhood of 400% slower than our existing hash function.
I've been talking to them. They're happy to look at our proposed changes. They indicate that a non-zero start value is sufficient to prevent the attack, but D. J. Bernstein disagrees with them. He also has indicated a willingness to look at our solution. -Paul

On Sun, 1 Jan 2012 21:55:52 -0800 Paul McMillan <paul@mcmillan.ws> wrote:
Shouldn't it be r[i % len(r)] instead? (refer to yesterday's #python-dev discussion)
I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function.
Again, we could re-use FNV-1's primes, since they claim they have better dispersion properties than the average prime. Regards Antoine.

Am 02.01.2012 06:55, schrieb Paul McMillan:
I've pushed a new patch http://hg.python.org/features/randomhash/rev/0a65d2462e0c The changeset adds the murmur3 hash algorithm with some minor changes, for example more random seeds. At first I was worried that murmur might be slower than our old hash algorithm. But in fact it seems to be faster! Pybench 10 rounds on my Core2 Duo 2.60: py3k: 3.230 sec randomahash: 3.182 sec Christian

On Mon, Jan 2, 2012 at 7:18 PM, Christian Heimes <lists@cheimes.de> wrote:
It seems for 32-bit version you are using pid for the two constants. Also, it's unclear why you even need to use a random constant for the final pass, you already use random constant as an initial h1, and it should be enough, no need to use for k1. Same for 128-bit: k1, k2, k3, k4 should be initialized to zero, these are key data, they don't need to be mixed with anything. Also, I'm not sure how portable is the always_inline attribute, is it supported on all compilers and all platforms?

On Mon, Jan 2, 2012 at 10:53 PM, Alexey Borzenkov <snaury@gmail.com> wrote:
Sorry, sent too soon. What I mean is that you're initializing a pretty big array of values when you only need a 32-bit value. Pid, in my opinion might be too predictable, it would be a lot better to simply hash pid and gettimeofday bytes to produce this single 32-bit value and use it for h1, h2, h3 and h4 in both 32-bit and 128-bit versions.
Also, I'm not sure how portable is the always_inline attribute, is it supported on all compilers and all platforms?

Le 01/01/2012 04:29, Paul McMillan a écrit :
If we want to protect a website against this attack for example, we must suppose that the attacker can inject arbitrary data and can get (indirectly) the result of hash(str) (e.g. with the representation of a dict in a traceback, with a JSON output, etc.).
The goal isn't perfection, but we need to do better than a simple salt.
I disagree. I don't want to break backward compatibility and have a hash() function different for each process, if the change is not an effective protection against the "hash vulnerability". It's really hard to write a good (secure) hash function: see for example the recent NIST competition (started in 2008, will end this year). Even good security researcher are unable to write a strong and fast hash function. It's easy to add a weakness in the function if you don't have a good background in cryptography. The NIST competition gives 4 years to analyze new hash functions. We should not rush to add a quick "hack" if it doesn't solve correctly the problem (protect against a collision attack and preimage attack). http://en.wikipedia.org/wiki/NIST_hash_function_competition http://en.wikipedia.org/wiki/Collision_attack Runtime performance does matter, I'm not completly sure that changing Python is the best place to add a countermeasure against a vulnerability. I don't want to slow down numpy for a web vulnerability. Because there are different use cases, a better compromise is maybe to add a runtime option to use a secure hash function, and keep the unsafe but fast hash function by default.
I propose we modify the string hash function like this:
Always allocate 2**21 bytes just to workaround one specific kind of attack is not acceptable. I suppose that the maximum acceptable is 4096 bytes (or better 256 bytes). Crytographic hash functions don't need random data, why would Python need 2 MB (!) for its hash function? Victor

On Sat, Dec 31, 2011 at 9:11 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hm, you're right, that's no longer a concern. (Though ATM the hashes still *are* compatible.)
Merry hashes, weakrefs and thread-local memoryviews to everyone!
:-) -- --Guido van Rossum (python.org/~guido)

Am 01.01.2012 00:56, schrieb Guido van Rossum:
I've created a clone at http://hg.python.org/features/randomhash/ as a testbed. The code creates the seed very early in PyInitializeEx(). The method isn't called on fork() but on exec().
The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. A hashing algorithm without a meet-in-the-middle vulnerability would reduce the pressure on a good and secure seed, too.
+1 Should the randomization be disabled on 2.5 to 3.2 by default to reduce backward compatibility issues? Christian

Le dimanche 01 janvier 2012 à 17:34 +0100, Christian Heimes a écrit :
I don't understand. FNV-1 multiplies the current running result with a prime and then xors it with the following byte. This is also what we do. (I'm assuming 1000003 is prime) I see two differences: - FNV starts with a non-zero constant offset basis - FNV uses a different prime than ours (as a side note, FNV operates on bytes, but for unicode we must operate on code points in [0, 1114111]: although arguably the common case is hashing ASCII substrings (protocol tokens etc.)) Regards Antoine.

Am 01.01.2012 17:54, schrieb Antoine Pitrou:
There must be a major difference somewhere inside the algorithm. The talk at the CCC conference in Berlin mentions that Ruby 1.9 is not vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C code of FNV is more complex than our code, too. Christian

Sufficient against their current attack. But will it last? For a long-running server, there must be plenty of ways information can leak that will help guessing that start value. The alternative, to provide a dict-like datastructure for use with untrusted input, deserves consideration. Perhaps something simpler than a balanced tree would do? How about a dict-like class that is built on a lazily sorted list? Insertions basically just do list.append and set a dirty-flag, and lookups use bisect - sorting first if the dirty-flag is set. It wouldn't be complete dict replacement by any means, mixing insertions and lookups would have terrible performance, but for something like POST parameters it should be good enough. I half expected to find something like that on activestate recipes already, but couldn't find any. regards, Anders

On Dec 31, 2011, at 04:56 PM, Guido van Rossum wrote:
Is there a tracker issue yet? The discussion should probably move there.
I think the answer to this was "no"... until now. http://bugs.python.org/issue13703 Proposed patches should be linked to this issue now. Please nosy yourself if you want to follow the progress. Cheers, -Barry

On Wed, Dec 28, 2011 at 6:28 PM, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
Ironically, this morning I ran across a discussion from about 8 years ago on basically the same thing: http://mail.python.org/pipermail/python-dev/2003-May/035874.html From what I read in the thread, it didn't seem like anyone here was too worried about it. Does this new research change anything? -eric

FWIW, Uncle Timmy considers the non-randomized hashes to be a virtue. It is believed that they give us better-than-random results for commonly encountered datasets. A change to randomized hashes would have a negative performance impact on those cases. Also, randomizing the hash wreaks havoc on doctests, book examples not matching actual dict reprs, and on efforts by users to optimize the insertion order into dicts with frequent lookups. Raymond On Dec 28, 2011, at 5:28 PM, Michael Foord wrote:

Am 29.12.2011 03:09, schrieb Raymond Hettinger:
My five cents on the topic: I totally concur with Raymound. He, Tim and all the others did a fantastic job with the dict implementation and optimization. We shouldn't overreact and mess with the current hashing and dict code just because a well-known and old attack vector pops up again. The dict code is far too crucial for Python's overall performance. However the issue should be documented in our docs. I've been dealing with web stuff and security for almost a decade. I've seen far worse attack vectors. This one can easily be solved with a couple of lines of Python code. For example Application developers can limit the maximum amount of POST parameters to a sensible amount and limit the length of each key, too. The issue less severe on platforms with 64bit hashes, so it won't affect most people. I think only 32bit Unix and Windows in general (32bit long) are in trouble. CPython could aid developers with a special subclass of dict. The crucial lookup function is already overwrite-able per dict instance and on subclasses of dict through PyDictObj's struct member PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash). For example specialized subclass could limit the seach for a free slot to n recursions or choose to ignore the hash argument and calculate its own hash of the key. Christian

Am 29.12.2011 12:10, schrieb Antoine Pitrou:
Web framework like Django or CherryPy can be considered an application from the CPython core's point of view. ;) You are right. The term "framework" is a better word.
Yeah! I was thinking about the same when I wrote "calculate its own hash" but I was too sloppy to carry on my argument. Please take 3am as my excuse.

Raymond Hettinger wrote:
Tim Peter's analysis applies mainly to ints which would be unchanged. A change to the hash function for strings would make no difference to the performance of the dict, as the ordering of the hash values is already quite different from the ordering of the strings for any string of more than 3 characters.
The docs clearly state that the ordering of iteration over dicts is arbitrary. Perhaps changing it once in a while might be a good thing :) Cheers, Mark.

On Thu, 29 Dec 2011 11:25:03 +0000 Mark Shannon <mark@hotpy.org> wrote:
We already change it once in a while. http://twistedmatrix.com/trac/ticket/5352 ;) Regards Antoine.

Michael Foord wrote:
The attack relies on being able to predict the hash value for a given string. Randomising the string hash function is quite straightforward. There is no need to change the dictionary code. A possible (*untested*) patch is attached. I'll leave it for those more familiar with unicodeobject.c to do properly. Cheers, Mark

Mark Shannon wrote:
The paper mentions that several web frameworks work around this by limiting the number of parameters per GET/POST/HEAD request. This sounds like a better alternative than randomizing the hash function of strings. Uncontrollable randomization has issues when you work with multi-process setups, since the processes would each use different hash values for identical strings. Putting the base_hash value under application control could be done to solve this problem, making sure that all processes use the same random base value. BTW: Since your randomization trick uses the current time, it would also be rather easy to tune an attack to find the currently used base_hash. To make this safe, you'd have to use a more random source for initializing the base_hash. Note that the same hash collision attack can be used for other key types as well, e.g. integers (where it's very easy to find hash collisions), so this kind of randomization would have to be applied to other basic types too. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 29 2011)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Am 29.12.2011 12:13, schrieb Mark Shannon:
I'm worried that hash randomization of str is going to break 3rd party software that rely on a stable hash across multiple Python instances. Persistence layers like ZODB and cross interpreter communication channels used by multiprocessing may (!) rely on the fact that the hash of a string is fixed. Perhaps the dict code is a better place for randomization. The code in lookdict() and lookdict_unicode() could add a value to the hash. My approach is less intrusive and also closes the attack vector for all possible objects including str, byte, int and so on. I like also Armin's idea of an optional hash randomization. Christian

On Thu, Dec 29, 2011 at 8:19 AM, Christian Heimes <lists@cheimes.de> wrote:
ZODB does not rely on a fixed hash function for strings; for any application to rely on a stable hash would cause problems when updating Python versions. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> "A person who won't read has no advantage over one who can't read." --Samuel Langhorne Clemens

Le 29/12/2011 14:19, Christian Heimes a écrit :
Perhaps the dict code is a better place for randomization.
The problem is the creation of a dict with keys all having the same hash value. The current implementation of dict uses a linked-list. Adding a new item requires to compare the new key to all existing keys (compare the value, not the hash, which is much slower). We had to change completly how dict is implemented to be able to fix this issue. I don't think that we can change the dict implementation without breaking backward compatibility or breaking applications. Change the implementation would change dict properties, and applications rely on the properties of the current implementation. Tell me if I am wrong. Victor

Am 31.12.2011 03:31, schrieb Victor Stinner:
You are right and I was wrong. We can't do the randomization inside the dict code. The randomization factor must used as initialization factor (IV) for the hashing algorithm. At first I thought the attack used the unique property of Python's dict implementation (perturbed hash instead of buckets for equal hashes) but I was wrong. It took me several hours of reading and digging into the math until I figured out my mistake. Sorry! :) This means we can't implement a salted dict unless the salted dict re-implemention the hash algorithm for unicode, bytes and memoryview. I doubt that a wise idea. I've checked my first draft of a possible solution: http://hg.python.org/features/randomhash/ . The pseudo RNG has to be replaced with something better and it's missing an option to feed a seed, too. Christian

On 1/3/2012 5:02 PM, Bill Janssen wrote:
The doc for id() now says "This is an integer which is guaranteed to be unique and constant for this object during its lifetime." Since the default 3.2.2 hash for my win7 64bit CPython is id-address // 16, it can have no longer guarantee. I suggest that hash() doc say something similar: http://bugs.python.org/issue13707 -- Terry Jan Reedy

On Wed, Jan 4, 2012 at 12:02 AM, Bill Janssen <janssen@parc.com> wrote:
A lot of software will break their tests, because dict ordering would depend on the particular run. I know, because some of them break on pypy which has a different dict ordering. This is probably a good thing in general, but is it really worth it? People will install python 2.6.newest and stuff *will* break. Is it *really* a security issue? We knew all along that dicts are O(n^2) in worst case scenario, how is this suddenly a security problem? Cheers, fijal

On Wed, Jan 04, 2012 at 11:55:13AM +0100, Antoine Pitrou wrote:
I don't think that's news either. http://mail.python.org/pipermail/python-dev/2003-May/035907.html and http://twistedmatrix.com/pipermail/twisted-python/2003-June/004339.html for instance show that in 2003 it was clearly known to at least be likely to be an exploitable DoS in common code (a dict of HTTP headers or HTTP form keys). There was debate about whether it's the language's responsibility to mitigate the problem or if apps should use safer designs for handling untrusted input (e.g. limit the number of keys input is allowed to create, or use something other than dicts), and debate about just how practical an effective exploit would be. But I think it was understood to be a real concern 8 years ago, so not exactly sudden. Just because it's old news doesn't make it not a security problem, of course. -Andrew.

On Thu, 5 Jan 2012 15:26:27 +1100 Andrew Bennetts <andrew@bemusement.org> wrote:
That's not news indeed, but that doesn't make it less of a problem, especially now that the issue has been widely publicized through a conference and announcements on several widely-read Web sites. That said, only doing the security fix in 3.3 would have the nice side effect of pushing people towards Python 3, so perhaps I'm for it after all. Half-jokingly, Antoine.

On Thu, Jan 5, 2012 at 3:39 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Just to make things clear - stdlib itself has 1/64 of tests relying on dict order. Changing dict order in *older* pythons will break everyone's tests and some peoples code. Making this new 2.6.x release would mean that people using new python 2.6 would have to upgrade an unspecified amount of their python packages, that does not sound very cool. Also consider that new 2.6.x would go as a security fix to old ubuntu, but all other packages won't, because they'll not contain security fixes. Just so you know Cheers, fijal

On 1/5/2012 9:34 AM, Maciej Fijalkowski wrote:
Why should CPython by constrained by broken policies of Ubuntu? If the other packages must be fixed so they work correctly with a security fix in Python, then they should be considered as containing a security fix. If they aren't, then that is a broken policy. On the other hand, it is very true that the seductive convenience of dict (readily available, good performance) in normal cases have created the vulnerability because its characteristics are a function of the data inserted, and when used for data that is from unknown, possibly malicious sources, that is a bug in the program that uses dict, not in dict itself. So it seems to me that: 1) the security problem is not in CPython, but rather in web servers that use dict inappropriately. 2) changing CPython in a way that breaks code is not a security fix to CPython, but rather a gratuitous breakage of compatibility promises, wrapped in a security-fix lie. The problem for CPython here can be summarized as follows: a) it is being blamed for problems in web servers that are not problems in CPython b) perhaps dict documentation is a bit too seductive, in not declaring that data from malicious sources could cause its performance to degrade significantly (but then, any programmer who has actually taken a decent set of programming classes should understand that, but on the other hand, there are programmers who have not taken such classes). c) CPython provides no other mapping data structures that rival the performance and capabilities of dict as an alternative, nor can such data structures be written in CPython, as the performance of dict comes not only from hashing, but also from being written in C. The solutions could be: A) push back on the blame: it is not a CPython problem B) perhaps add a warning to the documentation for the naïve, untrained programmers C) consider adding an additional data structure to the language, and mention it in the B warning for versions 3.3+. On the other hand, the web server vulnerability could be blamed on CPython in another way: identify vulnerable packages in the stdlib that are likely the be used during the parsing of user-supplied data. Ones that come to mind (Python 3.2) are: urllib.parse (various parse* functions) (package names different in Python 2.x) cgi (parse_multipart, FieldStorage) So, fixing the vulnerable packages could be a sufficient response, rather than changing the hash function. How to fix? Each of those above allocates and returns a dict. Simply have each of those allocate and return and wrapped dict, which has the following behaviors: i) during __init__, create a local, random, string. ii) for all key values, prepend the string, before passing it to the internal dict. Changing these vulnerable packages rather than the hash function is a much more constrained change, and wouldn't create bugs in programs that erroneously depend on the current hash function directly or indirectly. This would not fix web servers that use their own parsing and storage mechanism for <FORM> fields, if they have also inappropriately used a dict as their storage mechanism for user supplied data. However, a similar solution could be similarly applied by the authors of those web servers, and would be a security fix to such packages, so should be applied to Ubuntu, if available there, or other systems with security-only fix acceptance. This solution does not require changes to the hash, does not require a cryptographicly secure hash, and does not require code to be added to the initialization of Python before normal objects and mappings can be created. If a port doesn't contain a good random number generator, a weak one can be subsitituted, but such decisions can be made in Python code after the interpreter is initialized, and use of stdlib packages is available.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/05/2012 02:14 PM, Glenn Linderman wrote:
1) the security problem is not in CPython, but rather in web servers that use dict inappropriately.
Most webapp vulnerabilities are due to their use of Python's cgi module, which it uses a dict to hold the form / query string data being supplied by untrusted external users. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8F/uEACgkQ+gerLs4ltQ679QCgqKPYYwEetKR3bEMVh5eukLin cA8An3XJMYWhK5MutjbOCxCfYzKXmDzc =V3lh -----END PGP SIGNATURE-----

Am 05.01.2012 21:10, schrieb Ethan Furman:
You'd have to fix any Python core module that may handle data from untrusted sources. The issue isn't limited to web apps and POST requests. It's possible to trigger the DoS from JSON, a malicious PDF, JPEG's EXIF metadata or any other data. Oh, and somebody has to fix all 3rd party modules, too. Christian

On 1/5/2012 3:10 PM, Ethan Furman wrote:
Tres Seaver wrote:
I think both should be done. For web applications, it would be best to reject DOS attempts with 'random' keys in O(1) time rather than in O(n) time even with improved hash. But some other apps, like the Python interpreter itself, 'random' names may be quite normal. -- Terry Jan Reedy

On Fri, Jan 6, 2012 at 8:15 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Not a good idea - a lot of the 3rd party tests that depend on dict ordering are going to be using those modules anyway, so scattering our solution across half the standard library is needlessly creating additional work without really reducing the incompatibility problem. If we're going to change anything, it may as well be the string hashing algorithm itself. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/5/2012 4:10 PM, Nick Coghlan wrote:
Thanks for the implementation, Serhiy. That is the sort of thing I had in mind, indeed.
Not a good idea - a lot of the 3rd party tests that depend on dict ordering are going to be using those modules anyway,
Stats? Didn't someone post a list of tests that fail when changing the hash? Oh, those were stdlib tests, not 3rd party tests. I'm not sure how to gather the stats, then, are you?
Half the standard library? no one has cared to augment my list of modules, but I have seen reference to JSON in addition to cgi and urllib.parse. I think there are more than 6 modules in the standard library...
If we're going to change anything, it may as well be the string hashing algorithm itself.
Changing the string hashing algorithm is known (or at least no one has argued otherwise) to be a source of backward incompatibility that will break programs. My proposal (and Serhiy's implementation, assuming it works, or can be easily tweaked to work, I haven't reviewed it in detail or attempted to test it) will only break programs that have vulnerabilities. I failed to mention one other benefit of my proposal: every web request would have a different random prefix, so attempting to gather info is futile: the next request has a different random prefix, so different strings would collide.
Cheers, Nick.
Indeed it is nice when we can be cheery even when arguing, for the most part :) I've enjoyed reading the discussions in this forum because most folks have respect for other people's opinions, even when they differ.

On Thu, 5 Jan 2012 19:34:13 +0200 Maciej Fijalkowski <fijall@gmail.com> wrote:
Breaking tests is not a problem: they are typically not run by production code and so people can take the time to fix them. Breaking other code is a problem if it is legitimate. Relying on dict ordering is totally wrong and I don't think we should care about such cases. The only issue is when relying on hash() being stable accross runs. But hashing already varies from build to build (32-bit vs. 64-bit) and I think that anyone seriously relying on it should already have been bitten.
How about 2.7? Do you think it should also remain untouched? I am ok for leaving 2.6 alone (that's Barry's call anyway) but 2.7 is another matter - should people migrate to 3.x to get the security fix? As for 3.2, it should certainly get the fix IMO. There are not many Python 3 legacy applications relying on hash() stability, I think.
Ubuntu can decide *not* to ship the fix if they prefer it like that. Their policies and decisions, though, should not taint ours. Regards Antoine.

On Thu, 2012-01-05 at 19:34 +0200, Maciej Fijalkowski wrote:
We have similar issues in RHEL, with the Python versions going much further back (e.g. 2.3) When backporting the fix to ancient python versions, I'm inclined to turn the change *off* by default, requiring the change to be enabled via an environment variable: I want to avoid breaking existing code, even if such code is technically relying on non-guaranteed behavior. But we could potentially tweak mod_python/mod_wsgi so that it defaults to *on*. That way /usr/bin/python would default to the old behavior, but web apps would have some protection. Any such logic here also suggests the need for an attribute in the sys module so that you can verify the behavior.

On 5 January 2012 19:33, David Malcolm <dmalcolm@redhat.com> wrote:
Uh, surely no-one is suggesting backporting to "ancient" versions? I couldn't find the statement quickly on the python.org website (so this is via google), but isn't it true that 2.6 is in security-only mode and 2.5 and earlier will never get the fix? Having a source-only release for 2.6 means the fix is "off by default" in the sense that you can choose not to build it. Or add a #ifdef to the source if it really matters. Personally, I find it hard to see this as a Python security hole, but I can sympathise with the idea that it would be nice to make dict "safer by default". (Although the benefit for me personally would be zero, so I'm reluctant for the change to have a detectable cost...) My feeling is that it should go into 2.7, 3.2, and 3.3+, but with no bells and whistles to switch it off or the like. If it's not suitable to go in on that basis, restrict it to 3.3+ (where it's certainly OK) and advise users of earlier versions to either upgrade or code defensively to avoid hitting the pathological case. Surely that sort of defensive code should be second nature to the people who might be affected by the issue? Paul.

On Jan 05, 2012, at 08:35 PM, Paul Moore wrote:
Correct, although there's no reason why a patch for versions older than 2.6 couldn't be included on a python.org security page for reference in CVE or other security notifications. Distros that care about versions older than Python 2.6 will basically be back-porting the patch anyway.
My feeling is that it should go into 2.7, 3.2, and 3.3+, but with no bells and whistles to switch it off or the like.
I like David Malcolm's suggestion, but I have no problem applying it to 3.3, enabled by default with no way to turn it off. The off-by-default on-switch policy for stable releases would be justified by maximum backward compatibility conservativeness. -Barry

On Thu, Jan 05, 2012 at 08:35:57PM +0000, Paul Moore wrote:
I think when dmalcolm says "backporting" he means that he'll have to backport the fix from modern, supported-by-python.org python to the ancient python's that he's supporting as part of the Linux distributions where he's the python package maintainer. I'm thinking he's mentioning it here mainly to see if someone thinks that his approach for those distributions causes anyone to point out a reason not to diverge from upstream in that manner.
I don't think that this would satisfy dmalcolm's needs. What he's talking about sounds more like a runtime switch (possibly only when initializing, though, not on-the-fly). -Toshio

On Jan 05, 2012, at 02:33 PM, David Malcolm wrote:
This sounds like a reasonable compromise for all stable Python releases. It can be turned on by default for Python 3.3. If you also make the default setting easy to change (i.e. parameterized in one place), then distros can make their own decision about the default, although I'd argue for the above default approach for Debian/Ubuntu.
Any such logic here also suggests the need for an attribute in the sys module so that you can verify the behavior.
That would be read-only though, right? -Barry

Am 05.01.2012 21:45, schrieb Barry Warsaw:
Hey Barry, stop stealing my ideas! :) I've argued for these default settings for days. ver delivery randomized hashing ========================================== 2.3 patch disabled by default 2.4 patch disabled 2.5 patch disabled 2.6 release disabled 2.7 release disabled 3.0 ignore? disabled 3.1 release disabled 3.2 release disabled 3.3 n/a yet enabled by default 2.3 to 2.5 are still used in production (RHEL, Ubuntu LTS). Guido has stated that he needs a patch for 2.4, too. I think we may safely ignore Python 3.0. Nobody should use Python 3.0 on a production system. I've suggested the env var PYRANDOMHASH. It's easy to set env vars in Apache. For example Debian/Ubuntu has /etc/apache2/envvars. Settings for PYRANDOMHASH: PYRANDOMHASH=1 enable randomized hashing function PYRANDOMHASH=/path/to/seed enable randomized hashing function and read seed from 'seed' PYRANDOMHASH=0 disable randomed hashing function Since there isn't an easy way to set env vars in a shebang line since something like #!/usr/bin/env PYRANDOMHASH=1 python2.7 doesn't work, we could come up with a solution the shebang. IMHO the setting for the default setting should be a compile time option. It's reasonable easy to extend the configure script to support --enable-randomhash / --disable-randomhash. The MS VC build scripts can grow a flag, too. I still think that the topic needs a PEP. A couple of days ago I started with a PEP. But Guido told me that he doesn't see a point in a PEP because he prefers a small and quick solution, so I stopped working on it. However the arguments, worries and ideas in this enormous topic have repeated over and over. We know from experience that a PEP is a great way to explain the how, what and why of the change as well as the paths we didn't take. Christian

On Jan 05, 2012, at 10:40 PM, Christian Heimes wrote:
Hey Barry, stop stealing my ideas! :) I've argued for these default settings for days.
:)
I've suggested the env var PYRANDOMHASH. It's easy to set env vars in Apache. For example Debian/Ubuntu has /etc/apache2/envvars.
For consistency, it really should be PYTHONSOMETHING. I personally don't care how long it is (e.g. PYTHONIOENCODING).
Why not PYTHONHASHSEED then?
We have precedence for mirroring startup options and envars, so it doesn't bother me to add such a switch to Python 3.3. It *does* bother me to add a switch to any stable release.
One way to look at it is to have a quick-and-dirty solution for stable releases. It could be suboptimal from a ui point of view because of backward compatibility issues. The PEP could then outline the boffo perfect solution for Python 3.3, which a section on how it will be backported to stable releases. Cheers, -Barry

2012/1/6 Barry Warsaw <barry@python.org>:
See my patch attached to the issue #13703? I prepared the code to be able to set easily the hash seed (it has a LCG, it's seed can be provided by the user directly). I agree that the value 0 should give the same behaviour than the actual hash (disable the randomized hash). I will add the variable in the next version of my patch.

David Malcolm wrote:
Surely the way to verify the behaviour is to run this from the shell: python -c print(hash("abcde")) twice, and see that the calls return different values. (Or have I misunderstood the way the fix is going to work?) In any case, I wouldn't want to rely on the presence of a flag in the sys module to verify the behaviour, I'd want to see for myself that hash collisions are no longer predictable. -- Steven

On Fri, Jan 6, 2012 at 10:07 AM, Steven D'Aprano <steve@pearwood.info> wrote:
More directly, you can just check that the hash of the empty string is non-zero. So -1 for a flag in the sys module - "hash('') != 0" should serve as a sufficient check whether or not process-level string hash randomisation is in effect. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Jan 6, 2012 at 10:59 AM, Benjamin Peterson <benjamin@python.org> wrote:
What exactly is the disadvantage of a sys attribute? That would seem preferable to an obscure incarnation like that.
Adding sys attributes in maintenance (or security) releases makes me nervous. However, Victor and Christian are right about the need for a special case to avoid leaking information, so my particular suggested check won't work. The most robust check would be to run sys.executable in a subprocess and check if it gives the same hash for a non-empty string as the current process. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Benjamin Peterson wrote:
There's nothing obscure about directly testing the hash. That's about as far from obscure as it is possible to get: you are directly testing the presence of a feature by testing the feature. Relying on a flag to tell you whether hashes are randomised adds additional complexity: now you need to care about whether hashes are randomised AND know that there is a flag you can look up and what it is called. And since the flag won't exist in all versions of Python, or even in all builds of a particular Python version, it isn't a matter of just testing the flag, but of doing the try...except or hasattr() dance to check whether it exists first. At some point, presuming that there is no speed penalty, the behaviour will surely become not just enabled by default but mandatory. Python has never promised that hashes must be predictable or consistent, so apart from backwards compatibility concerns for old versions, future versions of Python should make it mandatory. Presuming that there is no speed penalty, I'd argue in favour of making it mandatory for 3.3. Why do we need a flag for something that is going to be always on? -- Steven

Glenn Linderman wrote:
There *may* be a speed penalty, but I draw your attention to Paul McMillian's email on 1st of January: Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. and Christian Heimes' email on 3rd of January: The changeset adds the murmur3 hash algorithm with some minor changes, for example more random seeds. At first I was worried that murmur might be slower than our old hash algorithm. But in fact it seems to be faster! So I think that it's a fairly safe bet that there will be a solution that is as fast, or at worst, trivially slower, than the current hash function. But of course, benchmarks will be needed. -- Steven

Using my patch (random-2.patch), the overhead is 0%. I cannot see a difference with and without my patch. Numbers: --- unpatched: == 3 characters == 1 loops, best of 3: 459 usec per loop == 10 characters == 1 loops, best of 3: 575 usec per loop == 500 characters == 1 loops, best of 3: 1.36 msec per loop patched: == 3 characters == 1 loops, best of 3: 458 usec per loop == 10 characters == 1 loops, best of 3: 575 usec per loop == 500 characters == 1 loops, best of 3: 1.36 msec per loop --- (the patched version looks faster just because the timer is not reliable enough for such fast test) Script: --- echo "== 3 characters ==" ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' echo "== 10 characters ==" ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' echo "== 500 characters ==" ./python -m timeit -n 1 -s 'text=(("%0500i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%0500i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%0500i" % x) --- (Take the smallest timing for each test) "-n 1" is needed because the hash value is only computed once (is cached). I may be possible to have more reliable results by disabling completly the hash cache (comment "PyUnicode_HASH(self) = x;" line). Victor

Am 04.01.2012 08:59, schrieb Maciej Fijalkowski:
For example Microsoft has released an extraordinary and unscheduled security patch for the issue between Christmas and New Year. I don't normally use MS as reference but this should give you a hint about the severity. Have you watched the talk yet? http://www.youtube.com/watch?v=R2Cq3CLI6H8 Christian

Hi, Just some extra thoughts about the whole topic in the light of web applications (since this was hinted in the talk) running on Python: Yes, you can limit the number of maximum allowed parameters for post data but really there are so many places where data is parsed into hashing containers that it's quite a worthless task. Here a very brief list of things usually parsed into a dict or set and where it happens: - URL parameters and url encoded form data Generally this happens somewhere in a framework but typically also in utility libraries that deal with URLs. For instance the stdlib's cgi.parse_qs or urllib.parse.parse_qs on Python 3 do just that and that code is used left and right. Even if a framework would start limiting it's own URL parsing there is still a lot of code that does not do that the stdlib does that as well. With form data it's worse because you have multipart headers that need parsing and that is usually abstracted away so far from the user that they do not do that. Many frameworks just use the cgi module's parsing functions which also just directly feed into a dictionary. - HTTP headers. There is zero a WSGI framework can do about that since the headers are parsed into a dictionary by the WSGI server. - Incoming JSON data. Again outside of what the framework can do for the most part. simplejson can be modified to stop parsing with the hook stuff but nobody does that and since users invoke simplejson's parsing routines themselves most webapps would still be vulnerable even if all frameworks would fix the problem. - Hidden dict parameters. Things like the parameter part of content-type or the content-disposition headers are generally also just parsed into a dictionary. Likewise many frameworks parse things into set headers (for instance incoming etags). The cookie header is usually parsed into a dictionary as well. The issue is nothing new and at least my current POV on this topic was that your server should be guarded and shoot handlers of requests going rogue. Dictionaries are not the only thing that has a worst case performance that could be triggered by user input. That said. Considering that there are so many different places where things are probably close to arbitrarily long that is parsed into a dictionary or other hashing structure it's hard for a web application developer or framework to protect itself against. In case the watchdog is not a viable solution as I had assumed it was, I think it's more reasonable to indeed consider adding a flag to Python that allows randomization of hashes optionally before startup. However as it was said earlier, the attack is a lot more complex to carry out on a 64bit environment that it's probably (as it stands right now!) safe to ignore. The main problem there however is not that it's a new attack but that some dickheads could now make prebaked attacks against websites to disrupt them that might cause some negative publicity. In general though there are so many more ways to DDOS a website than this that I would rate the whole issue very low. Regards, Armin

Hi, Something I should add to this now that I thought about it a bit more: Assuming this should be fixed on a language level the solution would probably be to salt hashes. The most common hash to salt here is the PyUnicode hash for obvious reasons. - Option a: Compiled in Salt + Easy to implement - Breaks unittests most likely (those were broken in the first place but that's still a very annoying change to make) - Might cause problems with interoperability of Pythons compiled with different hash salts - You're not really solving the problem because each linux distribution (besides Gentoo I guess) would have just one salt compiled in and that would be popular enough to have the same issue. - Option b: Environment variable for the salt + Easy-ish to implement + Easy to synchronize over different machines - initialization for base types happens early and unpredictive which makes it hard for embedded Python interpreters (think mod_wsgi and other things) to specify the salt - Option c: Random salt at runtime + Easy to implement - impossible to synchronize - breaks unittests in the same way as a compiled in salt would do Where to add the salt to? Unicode strings and bytestrings (byte objects) I guess since those are the most common offenders. Sometimes tuples are keys of dictionaries but in that case a contributing factor to the hash is the string in the tuple anyways. Also related: since this is a security related issue, would this be something that goes into Python 2? Does that affect how a fix would look like? Regards, Armin

Hi, how about Option d: Host based salt + Easy-ish to implement – how about basing it on the hostname for example? + transparent for all processes on the same host - probably unit test breakage In fact, we could use host based as default with the option to specify own which would solve the sync problems. That said, I agree with Armin that fixing this in the frameworks isn't an option. Regards, Hynek Am Donnerstag, 29. Dezember 2011 um 13:57 schrieb Armin Ronacher:

+1 to option d (host-based salt) but would need to consistently order the hostnames/addresses to guarantee that all processes on the same machine got the same salt by default. +1 to option c (environment variable) as an override. And/or maybe an override on the command line. +1 to implementing the salt in the dictionary hash as an additive value. +0 to exposing the salt as a constant (3.3+ only) - or alternatively expose a hash function that just takes an existing hash and returns the salted hash. That would make it very easy for anything that wanted a salted hash to get one. For choosing the default salt, I think something like: a. If IPv6 is enabled, take the link-local address of the interface with the default route. Pretty much guaranteed not to change, can't be determined externally (salting doesn't need a secret, but it doesn't hurt), large number so probably a good salt. (If it is likely to change, a salt override should be being used instead). Don't use any other IPv6 address. In particular, never use a "temporary" IPv6" address like Windows assigns - multiprocessing could end up with instances with different salts. b. Take the FQDN of the machine. Tim Delaney

It's worth pointing out that if the salt is somehow exposed to an attacker, or is guessable, much of the benefit goes away. It's likely that a timing attack could be used to discover the salt if it is fixed per machine or process over a long period of time. If a salt is generally fixed per machine, but varies from machine-to-machine, I think we'll see an influx of frustrated devs who have something that works perfectly on their machine but not for others. It doesn't matter that they're doing it wrong, we'll still have to deal with them as a community. This seems like an argument in favor of randomizing it at runtime by default, so it fails early for them. Allowing an environment and command line override makes sense, as it allows users to rotate the salt as frequently as their needs dictate. -Paul

On Thu, 29 Dec 2011 13:57:07 +0100 Armin Ronacher <armin.ronacher@active-4.com> wrote:
This option would have my preference. I don't think hash() was ever meant to be "synchronizable". Already using a 32-bit Python will give you different results from a 64-bit Python. As for breaking unittests, these tests were broken in the first place. hash() does change from time to time.
Or it could be a process-wide constant for all dicts. If the constant is additive as proposed by Mark the impact should be negligible. (but the randomness must be good enough) Regards Antoine.

Am 29.12.2011 13:57, schrieb Armin Ronacher:
- Option d: Don't change __hash__ but only use randomized hash for PyDictEntry lookup + Easy to implement - breaks only software to relies on a fixed order of dict keys - breaks only a few to no unit tests IMHO we don't have to alter the outcome of hash("some string"), hash(1) and all other related types. We just need to reduce the change the an attacker can produce collisions in the dict (and set?) code that looks up the slot (PyDictEntry). How about adding the random value in Object/dictobject.c:lookdict() and lookdict_str() (Python 2.x) / lookdict_unicode() (Python 3.x)? With this approach the hash of all our objects stay the same and just the dict code needs to be altered. The approach has also the benefit that all possible objects gain a randomized hash.
IMHO it does affect the fix. A changed and randomized hash function may break software that relies on a stable hash() function. Christian

On Thu, 29 Dec 2011 14:32:21 +0100 Christian Heimes <lists@cheimes.de> wrote:
I basically agree with your proposal. The only downside is that custom hashed containers (such as _pickle.c's memotable) don't automatically benefit. That said, I think it would be difficult to craft an attack against the aforementioned memotable (you would have to basically choose the addresses of pickled objects). Regards Antoine.

On Thu, Dec 29, 2011 at 8:32 AM, Christian Heimes <lists@cheimes.de> wrote:
I don't understand how that helps a collision attack. If you can still generate two strings with the same (pre-randomized) hash, what difference does it make that the dict adds a random number? The post-randomized number will still be the same, no? Or does this attack just rely on the hash *remainders* being the same? If so, I can see how hashing the hash would help. But since the attacker doesn't know the modulus, and it can change as the dictionary grows, I would expect the attack to require matching hashes, not just matching hash remainders... unless I'm just completely off base here.

Am 29.12.2011 21:07, schrieb PJ Eby:
The attack doesn't need perfect collisions. The attacker calculates strings in a way so that their hashes results in as many collision as possible in the dict code. An attacker succeeds when the initial slot for an hash is filled and as many subsequent slots of the perturbed masked hash, too. Also an attacker can easily predict the size and therefore the mask for the hash remainder. A POST request parser usually starts with an empty dict and the growth rate of Python's dicts is well documented. The changing mask makes the attack just a tiny bit more challenging. The hash randomization idea adds a salt to throw the attacker of course. Instead of position = hash & mask it's now hash = salt + hash position = hash & mask where salt is a random, process global value that is fixed for the life time of the program. The salt also affects the perturbance during the search for new slots. As you already stated this salt won't be affective against full hash collisions. The attack needs A LOT of problematic strings to become an issue, possible hundred of thousands or even millions of keys in a very large POST request. In reality an attacker won't reach the full theoretical O(n^2) performance degradation for a hash table. But even more than O(n) instead of O(1) for a million keys in each request has some impact on your servers' CPUs. Some vendors have limited to POST request to 1 MB or the amount of keys to 10,000 to work around the issue. One paper also states that attacks on Python with 64bit is just academical for now. Christian

On 12/29/2011 4:31 PM, Christian Heimes wrote:
As I understood the talk (actually, the bit of Perl interpreter C code shown), the randomization is to change hash(s) to hash(salt+s) so that the salt is completely mixed into the hash from the beginning, rather than just tacked on at the end. -- Terry Jan Reedy

Am 29.12.2011 23:28, schrieb Terry Reedy:
Yes, the Perl and Ruby code uses a random seed as IV for hash generation. It's the best way to create randomized hashes but it might not be a feasible fix for Python 2.x. I'm worried that it might break applications that rely on stable hash values.

The talk was a presentation yesterday by Alexander Klink and Julian Wälde at the Chaos Communication Congress in Germany hashDoS@alech.de I read the non-technical summary at http://arstechnica.com/business/news/2011/12/huge-portions-of-web-vulnerable... and watched the video of the talk at https://www.youtube.com/watch?feature=player_embedded&v=_EEhviEO1Vo# My summary: hash table creation with N keys changes from amortized O(N) to O(N**2) time if the hash values of all the keys are the same. This should only happen for large N if done intentionally. This is easy to accomplish with a linear multiply and add hash function, such as used in PHP4 (but nowhere else that the authors found). A nonlinear multiply and xor hash function, used in one form or another by everything else, is much harder to break. It is *theoretically* vulnerable to brute-force search and this has been known for years. With a more cleaver meet-in-the-middle strategy, that builds a dict of suffixes and then searches for matching prefixes, 32-bit hashes are *practically* vulnerable. The attack depends on, for instance, 2**16 (64K) being 1/64K of 2**32. (I did not hear when this strategy was developed, but it is certainly more practical on a desktop now than even 8 years ago.) [64-bit hashes are much, much less vulnerable to attack, at least for now. So it seems to me that anyone who hashes potential attack data can avoid the problem by using 64-bit Python with 64-bit hash values. If I understood Antoine, that should be all 64-bit builds.] More summary: Perl added an #define option to start the hash calculation with non-zero value instead of 0 years ago to "avoid algorithmic complexity attacks". The patch is at 47:20 in the video. The authors believe all should do similarly. [The change amounts to adding a char, unknown to attackers, to the beginning of every string before hashing. So it adds a small bit of time. The code patch shown did not show the source of the non-zero seed or the timing and scope of any randomization. As the discussion here has shown, this is an important issue to applications. So 'do the same' is inadequate and over-simplified advice. I believe Armin's patch is similar to the Perl patch.] Since the authors sent out CERT alert about Nov 1, PHP has added to PHP5 a new function to limit the number of vars hashed. Microsoft will do something similar now with hash randomization to follow (maybe?). JRuby is going to do something. Java does not think it needs to change Java itself, but will leave all to the frameworks. [The discussion here suggests that this is an inadequate response for 32-bit systems like Java since one person/group may not control all the pieces of a server system. However, a person or group can run all pieces on a system Python with an option turned on.] -- Terry Jan Reedy

A flag will only be needed if the overhead of the fix is too high.
I suppose that there are still servers running 32 bits Python.
There are countermeasures for low level DDOS (ICMP ping flood, TCP syn flood, etc.). An application (or a firewall) cannot implement a countermeasure for this high level issue. It can only be fixed in Python directly (by changing the implementation of the dict type or of the hash function). Victor

Le 29/12/2011 02:28, Michael Foord a écrit :
A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included:
http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_...
This PDF doesn't explain exactly the problem and how it can be solved. Let's try to summarize this "vulnerability". The creation of a Python dictionary has a complexity of O(n) in most cases, but O(n^2) in the *worst* case. The attack tries to go into the worst case. It requires to compute a set of N keys having the same hash value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to compute these keys once. It looks like it is now cheap enough in practice to compute this dataset for Python (and other languages). A countermeasure would be to check that we don't have more than X keys with the same hash value. But in practice, we don't know in advance how data are processed, and there are too many input vectors in various formats. If we want to fix something, it should be done in the implementation of the dict type or in the hash algorithm. We can implement dict differently to avoid this issue, using a binary tree for example. Because dict is a fundamental type in Python, I don't think that we can change its implementation (without breaking backward compatibility and so applications in production). A possibility would be to add a *new* type, but all libraries and applications would need to be changed to fix the vulnerability. The last choice is to change the hash algorithm. The *idea* is the same than adding salt to hashed password (in practice it will be a little bit different): if a pseudo-random salt is added, the attacker cannot prepare a single dataset, he/she will have to regenerate a new dataset for each possible salt value. If the salt is big enough (size in bits), the attacker will need too much CPU to generate the dataset (compute N keys with the same hash value). Basically, it slows down the attack by 2^(size of the salt). -- Another possibility would be to replace our fast hash function by a better hash function like MD5 or SHA1 (so the creation of the dataset would be too slow in practice = too expensive), but cryptographic hash functions are much slower (and so would slow down Python too much). Limiting the size of the POST data doesn't solve the problem because there are many other input vectors and data formats. It may block the most simple attacks because the attacker cannot inject enough keys to slow down your CPU. Victor

Am 31.12.2011 03:22, schrieb Victor Stinner:
Correct. The meet-in-the-middle attack and the unique properties of algorithms that are similar to DJBX33A and DJBX33A make the attack easy on platforms with 32bit hash.
A BTree is too slow for common operations, it's O(log n) instead of O(1) in average. We can't replace our dict with a btree type. A new btree type is a lot of work, too. The unique structure of CPython's dict implementation makes it harder to get the number of values with equal hash. The academic hash map (the one I learnt about at university) uses a bucket to store all elements with equal hash (more precise hash: mod mask). However Python's dict however perturbs the hash until it finds a free slot its array. The second, third ... collision can be caused by a legit and completely different (!) hash.
That's the idea of randomized hashing functions as implemented by Ruby 1.8, Perl and others. The random seed is used as IV. Multiple rounds of multiply, XOR and MOD (integer overflows) cause a deviation. In your other posting you were worried about the performance implication. A randomized hash function just adds a single ADD operation, that's all. Downside: With randomization all hashes are unpredictable and change after every restart of the interpreter. This has some subtle side effects like a different outcome of {a:1, b:1, c:1}.keys() after a restart of the interpreter.
I agree with your analysis. Cryptographic hash functions are far too slow for our use case. During my research I found another hash function that claims to be fast and that may not be vulnerable to this kind of attack: http://isthe.com/chongo/tech/comp/fnv/ Christian

Christian Heimes, 31.12.2011 04:59:
Wouldn't Bob Jenkins' "lookup3" hash function fit in here? After all, it's portable, known to provide a very good distribution for different string values and is generally fast on both 32 and 64 bit architectures. http://burtleburtle.net/bob/c/lookup3.c The analysis is here: http://burtleburtle.net/bob/hash/doobs.html It seems that there's also support for generating 64bit hash values (actually 2x32bits) efficiently. Admittedly, this may require some adaptation for the PEP393 unicode memory layout in order to produce identical hashes for all three representations if they represent the same content. So it's not a drop-in replacement. Stefan

Am 07.01.2012 12:02, schrieb Stefan Behnel:
This thread as well as the ticket is getting so long that people barely have a chance to catch up ... Guido has stated that he doesn't want a completely new hash algorithm for Python 2.x to 3.2. A new hash algorithm for 3.3 needs a PEP, too. I've done some experiments with FNV and Murmur3. With Murmur3 128bit I've seen some minor speed improvements on 64bit platforms. At first I was surprised but it makes sense. Murmur3 operates on uint32_t blocks while Python's hash algorithm iterates over 1 byte (bytes, ASCII), 2 bytes (USC2) or 4 bytes (USC4) types. Since most strings are either ASCII or UCS2, the inner loop of the current algorithm is more tight.
Is this condition required and implemented at the moment? Christian

On 1/7/2012 12:57 PM, Christian Heimes wrote:
Am 07.01.2012 12:02, schrieb Stefan Behnel:
If o1 == o2, then hash(o1) == hash(o2) is an unstated requirement implied by "They [hash values] are used to quickly compare dictionary keys during a dictionary lookup." since hash(o1) != hash(o2) is taken to mean o1 != o2 (whereas hash(o1) == hash(o2) is taken to mean o1 == o2 is possible but must be checked). Hashing should be a coarsening of == as an equivalence relationship. -- Terry Jan Reedy

Victor Stinner writes:
I don't know the implementation issues well enough to claim it is a solution, but this hasn't been mentioned before AFAICS: While the dictionary probe has to start with a hash for backward compatibility reasons, is there a reason the overflow strategy for insertion has to be buckets containing lists? How about double-hashing, etc?

Am 31.12.2011 13:03, schrieb Stephen J. Turnbull:
Python's dict implementation doesn't use bucket but open addressing (aka closed hashed table). The algorithm for conflict resolution doesn't use double hashing. Instead it takes the original and (in most cases) cached hash and perturbs the hash with a series of add, multiply and bit shift ops.

Christian Heimes writes:
In an attack, this is still O(collisions) per probe (as any scheme where the address of the nth collision is a function of only the hash), where double hashing should be "roughly" O(1) (with double the coefficient). But that evidently imposes too large a performance burden on non-evil users, so it's not worth thinking about whether "roughly O(1)" is close enough to O(1) to deter or exhaust attackers. I withdraw the suggestion.

On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
This won't help, because the keys still have the same hash value. ANYTHING you do to them after they're generated will result in them still colliding. The *only* thing that works is to change the hash function in such a way that the strings end up with different hashes in the first place. Otherwise, you'll still end up with (deliberate) collisions. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of "too many". Seems a bit wasteful for the purpose, though.)

I don't think that would be wasteful. You wouldn't just use the tree for the case of too many collisions, but for any collision. You might special-case the case of a single key, i.e. start using the tree only if there is a collision. The issue is not the effort, but the need to support ordering if you want to use trees. So you might restrict this to dicts that have only str keys (which in practice should never have any collision, unless it's a deliberate setup). I'd use the tagged-pointer trick to determine whether a key is an object pointer (tag 0) or an AVL tree (tag 1). So in the common case of interned strings, the comparison for pointer equality (which is the normal case if the keys are interned) will succeed quickly; if pointer comparison fails, check the tag bit. Regards, Martin

On Wednesday, December 28, 2011 at 8:28 PM, Michael Foord wrote:
Back up link for the PDF: http://dl.dropbox.com/u/1374/2007_28C3_Effective_DoS_on_web_application_plat... Ocert disclosure: http://www.ocert.org/advisories/ocert-2011-003.html jesse

On Wednesday, December 28, 2011 at 8:37 PM, Jesse Noller wrote:
And more analysis/information: http://cryptanalysis.eu/blog/2011/12/28/effective-dos-attacks-against-web-ap...

A few thoughts on this: a) This is not a new issue, I'm curious what the new interest is in it. b) Whatever the solution to this is, it is *not* CPython specific, any decision should be reflected in the Python language spec IMO, if CPython has the semantic that dicts aren't vulnerable to hash collision then users *will* rely on this and another implementation having a different (valid) behavior opens up users to security issues. c) I'm not convinced a randomized hash is appropriate for the default dict, for a number of reasons: it's a performance hit on every dict operations, using a per-process seed means you can't compile the hash of an obj at Python's compile time, a per-dict seed inhibits a bunch of other optimizations. These may not be relevant to CPython, but they are to PyPy and probably the invoke-dynamic work on Jython (pursuant to point b). Therefore I think these should be considered application issues, since request limiting is difficult and error prone, I'd recommend the Python stdlib including a non-hash based map (such as a binary tree). Alex

On Wed, Dec 28, 2011 at 19:51, Alex Gaynor <alex.gaynor@gmail.com> wrote:
A few thoughts on this:
a) This is not a new issue, I'm curious what the new interest is in it.
Well they (the presenters of the report) had to be accepted to that conference for *something*, otherwise we wouldn't know they exist.

Am 29.12.2011 02:37, schrieb Jesse Noller:
--- Python uses a hash function which is very similar to DJBX33X, which can be broken using a meet-in-the-middle attack. It operates on register size and is thus different for 64 and 32 bit machines. While generating multi-collisions efficiently is also possible for the 64 bit version of the function, the resulting colliding strings are too large to be relevant for anything more than an academic attack. Plone as the most prominent Python web framework accepts 1 MB of POST data, which it parses in about 7 minutes of CPU time in the worst case. This gives an attacker with about 20 kbit/s the possibility to keep one Core Duo core constantly busy. If the attacker is in the position to have a Gigabit line available, he can keep about 50.000 Core Duo cores busy. --- If I remember correctly CPython uses the long for its hash function so 64bit Windows uses a 32bit hash.

On Wed, Dec 28, 2011 at 5:37 PM, Jesse Noller <jnoller@gmail.com> wrote:
Discussion of hash functions in general: http://burtleburtle.net/bob/hash/doobs.html Two of the best hash functions that currently exist: http://code.google.com/p/cityhash/ and http://code.google.com/p/smhasher/wiki/MurmurHash. I'm not sure exactly what problem the paper is primarily complaining about: 1. Multiply+add and multiply+xor hashes are weak: this would be solved by changing to either of the better-and-faster hashes I linked to above. On the other hand: http://mail.python.org/pipermail/python-3000/2007-September/010327.html 2. It's possible to find collisions in any hash function in a 32-bit space: only solved by picking a varying seed at startup or compile time. If you decide to change to a stronger hash function overall, it might also be useful to change the advice "to somehow mix together (e.g. using exclusive or) the hash values for the components" in http://docs.python.org/py3k/reference/datamodel.html#object.__hash__. hash(tuple(components)) will likely be better if tuple's hash is improved. Hash functions are already unstable across Python versions. Making them unstable across interpreter processes (multiprocessing doesn't share dicts, right?) doesn't sound like a big additional problem. Users who want a distributed hash table will need to pull their own hash function out of hashlib or re-implement a non-cryptographic hash instead of using the built-in one, but they probably need to do that already to allow themselves to upgrade Python. Jeffrey

On Sat, Dec 31, 2011 at 4:04 PM, Jeffrey Yasskin <jyasskin@gmail.com> wrote:
Here's an idea. Suppose we add a sys.hash_seed or some such, that's settable to an int, and defaults to whatever we're using now. Then programs that want a fix can just set it to a random number, and on Python versions that support it, it takes effect. Everywhere else it's a silent no-op. Downside: sys has to have slots for this to work; does sys actually have slots? My memory's hazy on that. I guess actually it'd have to be sys.set_hash_seed(). But same basic idea. Anyway, this would make fixing the problem *possible*, while still pushing off the hard decisions to the app/framework developers. ;-) Downside: every hash operation includes one extra memory access, but strings only compute their hash once anyway.) Given that changing dict won't help, and changing the default hash is a non-starter, an option to set the seed is probably the way to go. (Maybe with an environment variable and/or command line option so users can work around old code.)

ISTM the only reasonable thing is to have a random seed picked very early in the process, to be used to change the hash() function of str/bytes/unicode (in a way that they are still compatible with each other). The seed should be unique per process except it should survive fork() (but not exec()). I'm not worried about unrelated processes needing to have the same hash(), but I'm not against offering an env variable or command line flag to force the seed. I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash (as long as they can't force the seed -- this actually argues slightly *against* offering a way to force the seed, except that we have strong backwards compatibility requirements). We need to fix this as far back as Python 2.6, and it would be nice if a source patch was available that works on Python 2.5 -- personally I do have a need for a 2.5 fix and if nobody creates one I will probably end up backporting the fix from 2.6 to 2.5. Is there a tracker issue yet? The discussion should probably move there. PS. I would propose a specific fix but I can't seem to build a working CPython from the trunk on my laptop (OS X 10.6, Xcode 4.1). I get this error late in the build: ./python.exe -SE -m sysconfig --generate-posix-vars Fatal Python error: Py_Initialize: can't initialize sys standard streams Traceback (most recent call last): File "/Users/guido/cpython/Lib/io.py", line 60, in <module> make: *** [Lib/_sysconfigdata.py] Abort trap -- --Guido van Rossum (python.org/~guido)

On Sat, Dec 31, 2011 at 4:56 PM, Guido van Rossum <guido@python.org> wrote:
FWIW I managed to build Python 2.6, and a trivial mutation of the string/unicode hash function (add 1 initially) made only three tests fail; test_symtable and test_json both have a dependency on dictionary order, test_ctypes I can't quite figure out what's going on. Oh, and an unrelated failure in test_sqlite: File "/Users/guido/pythons/p26/Lib/sqlite3/test/types.py", line 355, in CheckSqlTimestamp self.failUnlessEqual(ts.year, now.year) AssertionError: 2012 != 2011 I betcha that's because it's still 2011 here in Texas but already 2012 in UTC-land. Happy New Year everyone! :-) -- --Guido van Rossum (python.org/~guido)

This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process. The goal isn't perfection, but we need to do better than a simple salt. I propose we modify the string hash function like this: https://gist.github.com/0a91e52efa74f61858b5 This code is based on PyPy's implementation, but the concept is universal. Rather than choosing a single short random seed per process, we generate a much larger random seed (r). As we hash, we deterministically choose a portion of that seed and incorporate it into the hash process. This modification is a minimally intrusive change to the existing hash function, and so should not introduce unexpected side effects which might come from switching to a different class of hash functions. I've worked through this code with Alex Gaynor, Antoine Pitrou, and Victor Stinner, and have asked several mathematicians and security experts to review the concept. The reviewers who have gotten back to me thus far have agreed that if the initial random seed is not flawed, this should not overly change the properties of the hash function, but should make it quite difficult for an attacker to deduce the necessary information to predictably cause hash collisions. This function is not designed to protect against timing attacks, but should be nontrivial to reverse even with access to timing data. Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. This is probably an acceptable trade off, and actually provides better performance in the case of short strings than a high-entropy fixed-length seed prefix. -Paul

On Sat, Dec 31, 2011 at 8:29 PM, Paul McMillan <paul@mcmillan.ws> wrote:
Still, it would represent an effort for the attacker of a much greater magnitude than the current attack. It's all a trade-off -- at some point it'll just be easier for the attacker to use some other vulnerability. Also the class of vulnerable servers would be greatly reduced.
The goal isn't perfection, but we need to do better than a simple salt.
Perhaps.
I'm not sure I understand this. What's the worry about "a different class of hash functions"? (It may be clear that I do not have a deep mathematical understanding of hash functions.)
I forget -- what do we do on systems without urandom()? (E.g. Windows?)
Let's worry about timing attacks another time okay?
Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed. But I like the idea of a bigger random seed from which we deterministically pick some part. How about just initializing x to some subsequence of the seed determined by e.g. the length of the hashed string plus a few characters from it? -- --Guido van Rossum (python.org/~guido)

I agree that doing anything is better than doing nothing. If we use the earlier suggestion and prepend everything with a fixed-length seed, we need quite a bit of entropy (and so a fairly long string) to make a lasting difference.
This was mostly in reference to earlier suggestions of switching to cityhash, or using btrees, or other more invasive changes. Python 2.X is pretty stable and making large changes like that to the codebase can have unpredictable effects. We know that the current hash function works well (except for this specific problem), so it seems like the best fix will be as minimal a modification as possible, to avoid introducing bugs.
I forget -- what do we do on systems without urandom()? (E.g. Windows?) Windows has CryptGenRandom which is approximately equivalent.
Let's worry about timing attacks another time okay? Agreed. As long as there isn't a gaping hole, I'm fine with that.
Hm. I'm not sure I like the idea of extra arithmetic for every character being hashed.
But I like the idea of a bigger random seed from which we deterministically pick some part.
Yeah. This makes it much harder to attack, since it very solidly places the attacker outside the realm of "just brute force the key".
We did consider this, and if performance is absolutely the prime directive, this (or a variant) may be the best option. Unfortunately, the collision generator doesn't necessarily vary the length of the string. Additionally, if we don't vary based on all the letters in the string, an attacker can fix the characters that we do use and generate colliding strings around them. Another option to consider would be to apply this change to some but not all of the rounds. If we added the seed lookup xor operation for only the first and last 5 values of x, we would still retain much of the benefit without adding much computational overhead for very long strings. We could also consider a less computationally expensive operation than the modulo for calculating the lookup index, like simply truncating to the correct number of bits. -Paul

On Sat, Dec 31, 2011 at 10:57 PM, Paul McMillan <paul@mcmillan.ws> wrote:
Ah, but the effect of that long string is summarized in a single (32- or 64-bit) integer.
Agreed.
Yup.
But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input. It is important that this function touches every
character - if it only interacts with a subset of them, an attacker can fix that subset and vary the rest.
I sort of see your point, but I still think that if we could add as little per-character overhead as possible it would be best -- sometimes people *do* hash very long strings.
Still, much more work for the attacker.
I like that.
Sure. Thanks for thinking about all the details here!! -- --Guido van Rossum (python.org/~guido)

Different concern. What if someone were to have code implementing an external, persistent hash table, using Python's hash() function? They might have a way to rehash everything when a new version of Python comes along, but they would not be happy if hash() is different in each process. I somehow vaguely remember possibly having seen such code, or something else where a bit of random data was needed and hash() was used since it's so easily available. -- --Guido van Rossum (python.org/~guido)

PS. Is the collision-generator used in the attack code open source? -- --Guido van Rossum (python.org/~guido)

On 1/1/2012 10:13 AM, Guido van Rossum wrote:
PS. Is the collision-generator used in the attack code open source?
As I posted before, Alexander Klink and Julian Wälde gave their project email as hashDoS@alech.de. Since they indicated disappointment in not hearing from Python, I presume they would welcome engagement. -- Terry Jan Reedy

Am 01.01.2012 19:45, schrieb Terry Reedy:
Somebody should contact Alexander and Julian to let them know, that we are working on the matter. It should be somebody "official" for the initial contact, too. I've included Guido (BDFL), Barry (their initial security contact) and MvL (most prominent German core dev) in CC, as they are the logical choice for me. I'm willing to have a phone call with them once the contact has been established. IMHO it's slightly easier to talk in native tongue -- Alexander and Julian are German, too. Christian

On 01/02/2012 04:47 PM, Christian Heimes wrote:
I wouldn't expect too much -- they seem rather keen on cheap laughs: http://twitter.com/#!/bk3n/status/152068096448921600/photo/1/large Georg

On Jan 02, 2012, at 06:38 PM, Georg Brandl wrote:
I wouldn't expect too much -- they seem rather keen on cheap laughs:
http://twitter.com/#!/bk3n/status/152068096448921600/photo/1/large
Heh, so yeah, it won't be me contacting them. -Barry

Am 01.01.2012 16:13, schrieb Guido van Rossum:
I had the same concern as you and was worried that projects like ZODB might require a stable hash function. Fred already stated that ZODB doesn't use the hash in its btree structures. Possible solutions: * make it possible to provide the seed as an env var * disable randomizing as default setting or at least add an option to disable randomization IMHO the issue needs a PEP that explains the issue, shows all possible solutions and describes how we have solved the issue. I'm willing to start a PEP. Who likes to be the co-author? Christian

I agree that there are use cases for allowing users to choose the random seed, in much the same way it's helpful to be able to set it for the random number generator. This should probably be something that can be passed in at runtime. This feature would also be useful for users who want to synchronize the hashes of multiple independent processes, for whatever reason. For the general case though, randomization should be on by default. -Paul

But how about precomputing the intermediate value (x)? The hash is (mostly) doing x = f(x, c) for each c in the input.
That's a fair point. If we go down that avenue, I think simply choosing a random fixed starting value for x is the correct choice, rather than computing an intermediate value.
Yep, agreed. My original proposal did not adequately address this.
I believe this is a reasonable solution. An attacker could still manipulate the internal state of long strings, but the additional information at both ends should make that difficult to exploit. I'll talk it over with the reviewers.
Again, I'll talk to the reviewers (and run the randomness test battery) to be double-check that this doesn't affect the distribution in some unexpected way, but I think it should be fine.
PS. Is the collision-generator used in the attack code open source?
Not in working form, and they've turned down requests for it from other projects that want to check their work. If it's imperative that we have one, I can write one, but I'd rather not spend the effort if we don't need it. -Paul

Am 01.01.2012 06:57, schrieb Paul McMillan:
Your code at https://gist.github.com/0a91e52efa74f61858b5 reads about 2 MB (2**21 - 1) data from urandom. I'm worried that this is going to exhaust the OS's random pool and suck it dry. We shouldn't forget that Python is used for long running processes as well as short scripts. Your suggestion also increases the process size by 2 MB which is going to be an issue for mobile and embedded platforms. How about this: r = [ord(i) for i in os.urandom(256)] rs = os.urandom(4) # or 8 ? seed = rs[-1] + (rs[-2] << 8) + (rs[-3] << 16) + (rs[-4] << 24) def _hash_string(s): """The algorithm behind compute_hash() for a string or a unicode.""" from pypy.rlib.rarithmetic import intmask length = len(s) if length == 0: return -1 x = intmask(seed + (ord(s[0]) << 7)) i = 0 while i < length: o = ord(s[i]) x = intmask((1000003*x) ^ o ^ r[o % 0xff] i += 1 x ^= length return intmask(x) This combines a random seed for the hash with your suggestion. We also need to special case short strings. The above routine hands over the seed to attackers, if he is able to retrieve lots of single character hashes. The randomization shouldn't be used if we can prove that it's not possible to create hash collisions for strings shorter than X. For example 64bit FNV-1 has no collisions for 8 chars or less, 32bit FNV has no collisions for 4 or less cars. Christian Christian

I fixed a couple things in my proposed algorithm: https://gist.github.com/0a91e52efa74f61858b5 I had a typo, and used 21 instead of 12 for the size multiplier. We definitely don't need 2MB random data. The initialization of r was broken. Now it is an array of ints, so there's no conversion when it's used. I've adjusted it so there's 8k of random data, broken into 2048 ints. I added a length-based seed to the initial value of x. This prevents single-characters from being used to enumerate raw values from r. This is similar to the change proposed by Christian Heimes. Most importantly, I moved the xor with r[x % len_r] down a line. Before, it wasn't being applied to the last character.
The updated code always includes at least 2 lookups from r, which I believe solves the single-character enumeration problem. If we special-case part of our hash function for short strings, we may get suboptimal collisions between the two types of hashes. I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function. For the record, cryptographically strong hash functions are in the neighborhood of 400% slower than our existing hash function.
I've been talking to them. They're happy to look at our proposed changes. They indicate that a non-zero start value is sufficient to prevent the attack, but D. J. Bernstein disagrees with them. He also has indicated a willingness to look at our solution. -Paul

On Sun, 1 Jan 2012 21:55:52 -0800 Paul McMillan <paul@mcmillan.ws> wrote:
Shouldn't it be r[i % len(r)] instead? (refer to yesterday's #python-dev discussion)
I think Ruby uses FNV-1 with a salt, making it less vulnerable to this. FNV is otherwise similar to our existing hash function.
Again, we could re-use FNV-1's primes, since they claim they have better dispersion properties than the average prime. Regards Antoine.

Am 02.01.2012 06:55, schrieb Paul McMillan:
I've pushed a new patch http://hg.python.org/features/randomhash/rev/0a65d2462e0c The changeset adds the murmur3 hash algorithm with some minor changes, for example more random seeds. At first I was worried that murmur might be slower than our old hash algorithm. But in fact it seems to be faster! Pybench 10 rounds on my Core2 Duo 2.60: py3k: 3.230 sec randomahash: 3.182 sec Christian

On Mon, Jan 2, 2012 at 7:18 PM, Christian Heimes <lists@cheimes.de> wrote:
It seems for 32-bit version you are using pid for the two constants. Also, it's unclear why you even need to use a random constant for the final pass, you already use random constant as an initial h1, and it should be enough, no need to use for k1. Same for 128-bit: k1, k2, k3, k4 should be initialized to zero, these are key data, they don't need to be mixed with anything. Also, I'm not sure how portable is the always_inline attribute, is it supported on all compilers and all platforms?

On Mon, Jan 2, 2012 at 10:53 PM, Alexey Borzenkov <snaury@gmail.com> wrote:
Sorry, sent too soon. What I mean is that you're initializing a pretty big array of values when you only need a 32-bit value. Pid, in my opinion might be too predictable, it would be a lot better to simply hash pid and gettimeofday bytes to produce this single 32-bit value and use it for h1, h2, h3 and h4 in both 32-bit and 128-bit versions.
Also, I'm not sure how portable is the always_inline attribute, is it supported on all compilers and all platforms?

Le 01/01/2012 04:29, Paul McMillan a écrit :
If we want to protect a website against this attack for example, we must suppose that the attacker can inject arbitrary data and can get (indirectly) the result of hash(str) (e.g. with the representation of a dict in a traceback, with a JSON output, etc.).
The goal isn't perfection, but we need to do better than a simple salt.
I disagree. I don't want to break backward compatibility and have a hash() function different for each process, if the change is not an effective protection against the "hash vulnerability". It's really hard to write a good (secure) hash function: see for example the recent NIST competition (started in 2008, will end this year). Even good security researcher are unable to write a strong and fast hash function. It's easy to add a weakness in the function if you don't have a good background in cryptography. The NIST competition gives 4 years to analyze new hash functions. We should not rush to add a quick "hack" if it doesn't solve correctly the problem (protect against a collision attack and preimage attack). http://en.wikipedia.org/wiki/NIST_hash_function_competition http://en.wikipedia.org/wiki/Collision_attack Runtime performance does matter, I'm not completly sure that changing Python is the best place to add a countermeasure against a vulnerability. I don't want to slow down numpy for a web vulnerability. Because there are different use cases, a better compromise is maybe to add a runtime option to use a secure hash function, and keep the unsafe but fast hash function by default.
I propose we modify the string hash function like this:
Always allocate 2**21 bytes just to workaround one specific kind of attack is not acceptable. I suppose that the maximum acceptable is 4096 bytes (or better 256 bytes). Crytographic hash functions don't need random data, why would Python need 2 MB (!) for its hash function? Victor

On Sat, Dec 31, 2011 at 9:11 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hm, you're right, that's no longer a concern. (Though ATM the hashes still *are* compatible.)
Merry hashes, weakrefs and thread-local memoryviews to everyone!
:-) -- --Guido van Rossum (python.org/~guido)

Am 01.01.2012 00:56, schrieb Guido van Rossum:
I've created a clone at http://hg.python.org/features/randomhash/ as a testbed. The code creates the seed very early in PyInitializeEx(). The method isn't called on fork() but on exec().
The talkers claim and have shown that it's too easy to pre-calculate collisions with hashing algorithms similar to DJBX33X / DJBX33A. It might be a good idea to change the hashing algorithm, too. Paul as listed some new algorithms. Ruby 1.9 is using FNV http://isthe.com/chongo/tech/comp/fnv/ which promises to be fast with a good dispersion pattern. A hashing algorithm without a meet-in-the-middle vulnerability would reduce the pressure on a good and secure seed, too.
+1 Should the randomization be disabled on 2.5 to 3.2 by default to reduce backward compatibility issues? Christian

Le dimanche 01 janvier 2012 à 17:34 +0100, Christian Heimes a écrit :
I don't understand. FNV-1 multiplies the current running result with a prime and then xors it with the following byte. This is also what we do. (I'm assuming 1000003 is prime) I see two differences: - FNV starts with a non-zero constant offset basis - FNV uses a different prime than ours (as a side note, FNV operates on bytes, but for unicode we must operate on code points in [0, 1114111]: although arguably the common case is hashing ASCII substrings (protocol tokens etc.)) Regards Antoine.

Am 01.01.2012 17:54, schrieb Antoine Pitrou:
There must be a major difference somewhere inside the algorithm. The talk at the CCC conference in Berlin mentions that Ruby 1.9 is not vulnerable to meet-in-the-middle attacks and Ruby 1.9 uses FNV. The C code of FNV is more complex than our code, too. Christian

Sufficient against their current attack. But will it last? For a long-running server, there must be plenty of ways information can leak that will help guessing that start value. The alternative, to provide a dict-like datastructure for use with untrusted input, deserves consideration. Perhaps something simpler than a balanced tree would do? How about a dict-like class that is built on a lazily sorted list? Insertions basically just do list.append and set a dirty-flag, and lookups use bisect - sorting first if the dirty-flag is set. It wouldn't be complete dict replacement by any means, mixing insertions and lookups would have terrible performance, but for something like POST parameters it should be good enough. I half expected to find something like that on activestate recipes already, but couldn't find any. regards, Anders

On Dec 31, 2011, at 04:56 PM, Guido van Rossum wrote:
Is there a tracker issue yet? The discussion should probably move there.
I think the answer to this was "no"... until now. http://bugs.python.org/issue13703 Proposed patches should be linked to this issue now. Please nosy yourself if you want to follow the progress. Cheers, -Barry

On Wed, Dec 28, 2011 at 6:28 PM, Michael Foord <fuzzyman@voidspace.org.uk> wrote:
Ironically, this morning I ran across a discussion from about 8 years ago on basically the same thing: http://mail.python.org/pipermail/python-dev/2003-May/035874.html From what I read in the thread, it didn't seem like anyone here was too worried about it. Does this new research change anything? -eric

FWIW, Uncle Timmy considers the non-randomized hashes to be a virtue. It is believed that they give us better-than-random results for commonly encountered datasets. A change to randomized hashes would have a negative performance impact on those cases. Also, randomizing the hash wreaks havoc on doctests, book examples not matching actual dict reprs, and on efforts by users to optimize the insertion order into dicts with frequent lookups. Raymond On Dec 28, 2011, at 5:28 PM, Michael Foord wrote:

Am 29.12.2011 03:09, schrieb Raymond Hettinger:
My five cents on the topic: I totally concur with Raymound. He, Tim and all the others did a fantastic job with the dict implementation and optimization. We shouldn't overreact and mess with the current hashing and dict code just because a well-known and old attack vector pops up again. The dict code is far too crucial for Python's overall performance. However the issue should be documented in our docs. I've been dealing with web stuff and security for almost a decade. I've seen far worse attack vectors. This one can easily be solved with a couple of lines of Python code. For example Application developers can limit the maximum amount of POST parameters to a sensible amount and limit the length of each key, too. The issue less severe on platforms with 64bit hashes, so it won't affect most people. I think only 32bit Unix and Windows in general (32bit long) are in trouble. CPython could aid developers with a special subclass of dict. The crucial lookup function is already overwrite-able per dict instance and on subclasses of dict through PyDictObj's struct member PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash). For example specialized subclass could limit the seach for a free slot to n recursions or choose to ignore the hash argument and calculate its own hash of the key. Christian

Am 29.12.2011 12:10, schrieb Antoine Pitrou:
Web framework like Django or CherryPy can be considered an application from the CPython core's point of view. ;) You are right. The term "framework" is a better word.
Yeah! I was thinking about the same when I wrote "calculate its own hash" but I was too sloppy to carry on my argument. Please take 3am as my excuse.

Raymond Hettinger wrote:
Tim Peter's analysis applies mainly to ints which would be unchanged. A change to the hash function for strings would make no difference to the performance of the dict, as the ordering of the hash values is already quite different from the ordering of the strings for any string of more than 3 characters.
The docs clearly state that the ordering of iteration over dicts is arbitrary. Perhaps changing it once in a while might be a good thing :) Cheers, Mark.

On Thu, 29 Dec 2011 11:25:03 +0000 Mark Shannon <mark@hotpy.org> wrote:
We already change it once in a while. http://twistedmatrix.com/trac/ticket/5352 ;) Regards Antoine.

Michael Foord wrote:
The attack relies on being able to predict the hash value for a given string. Randomising the string hash function is quite straightforward. There is no need to change the dictionary code. A possible (*untested*) patch is attached. I'll leave it for those more familiar with unicodeobject.c to do properly. Cheers, Mark

Mark Shannon wrote:
The paper mentions that several web frameworks work around this by limiting the number of parameters per GET/POST/HEAD request. This sounds like a better alternative than randomizing the hash function of strings. Uncontrollable randomization has issues when you work with multi-process setups, since the processes would each use different hash values for identical strings. Putting the base_hash value under application control could be done to solve this problem, making sure that all processes use the same random base value. BTW: Since your randomization trick uses the current time, it would also be rather easy to tune an attack to find the currently used base_hash. To make this safe, you'd have to use a more random source for initializing the base_hash. Note that the same hash collision attack can be used for other key types as well, e.g. integers (where it's very easy to find hash collisions), so this kind of randomization would have to be applied to other basic types too. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 29 2011)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Am 29.12.2011 12:13, schrieb Mark Shannon:
I'm worried that hash randomization of str is going to break 3rd party software that rely on a stable hash across multiple Python instances. Persistence layers like ZODB and cross interpreter communication channels used by multiprocessing may (!) rely on the fact that the hash of a string is fixed. Perhaps the dict code is a better place for randomization. The code in lookdict() and lookdict_unicode() could add a value to the hash. My approach is less intrusive and also closes the attack vector for all possible objects including str, byte, int and so on. I like also Armin's idea of an optional hash randomization. Christian

On Thu, Dec 29, 2011 at 8:19 AM, Christian Heimes <lists@cheimes.de> wrote:
ZODB does not rely on a fixed hash function for strings; for any application to rely on a stable hash would cause problems when updating Python versions. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> "A person who won't read has no advantage over one who can't read." --Samuel Langhorne Clemens

Le 29/12/2011 14:19, Christian Heimes a écrit :
Perhaps the dict code is a better place for randomization.
The problem is the creation of a dict with keys all having the same hash value. The current implementation of dict uses a linked-list. Adding a new item requires to compare the new key to all existing keys (compare the value, not the hash, which is much slower). We had to change completly how dict is implemented to be able to fix this issue. I don't think that we can change the dict implementation without breaking backward compatibility or breaking applications. Change the implementation would change dict properties, and applications rely on the properties of the current implementation. Tell me if I am wrong. Victor

Am 31.12.2011 03:31, schrieb Victor Stinner:
You are right and I was wrong. We can't do the randomization inside the dict code. The randomization factor must used as initialization factor (IV) for the hashing algorithm. At first I thought the attack used the unique property of Python's dict implementation (perturbed hash instead of buckets for equal hashes) but I was wrong. It took me several hours of reading and digging into the math until I figured out my mistake. Sorry! :) This means we can't implement a salted dict unless the salted dict re-implemention the hash algorithm for unicode, bytes and memoryview. I doubt that a wise idea. I've checked my first draft of a possible solution: http://hg.python.org/features/randomhash/ . The pseudo RNG has to be replaced with something better and it's missing an option to feed a seed, too. Christian

On 1/3/2012 5:02 PM, Bill Janssen wrote:
The doc for id() now says "This is an integer which is guaranteed to be unique and constant for this object during its lifetime." Since the default 3.2.2 hash for my win7 64bit CPython is id-address // 16, it can have no longer guarantee. I suggest that hash() doc say something similar: http://bugs.python.org/issue13707 -- Terry Jan Reedy

On Wed, Jan 4, 2012 at 12:02 AM, Bill Janssen <janssen@parc.com> wrote:
A lot of software will break their tests, because dict ordering would depend on the particular run. I know, because some of them break on pypy which has a different dict ordering. This is probably a good thing in general, but is it really worth it? People will install python 2.6.newest and stuff *will* break. Is it *really* a security issue? We knew all along that dicts are O(n^2) in worst case scenario, how is this suddenly a security problem? Cheers, fijal

On Wed, Jan 04, 2012 at 11:55:13AM +0100, Antoine Pitrou wrote:
I don't think that's news either. http://mail.python.org/pipermail/python-dev/2003-May/035907.html and http://twistedmatrix.com/pipermail/twisted-python/2003-June/004339.html for instance show that in 2003 it was clearly known to at least be likely to be an exploitable DoS in common code (a dict of HTTP headers or HTTP form keys). There was debate about whether it's the language's responsibility to mitigate the problem or if apps should use safer designs for handling untrusted input (e.g. limit the number of keys input is allowed to create, or use something other than dicts), and debate about just how practical an effective exploit would be. But I think it was understood to be a real concern 8 years ago, so not exactly sudden. Just because it's old news doesn't make it not a security problem, of course. -Andrew.

On Thu, 5 Jan 2012 15:26:27 +1100 Andrew Bennetts <andrew@bemusement.org> wrote:
That's not news indeed, but that doesn't make it less of a problem, especially now that the issue has been widely publicized through a conference and announcements on several widely-read Web sites. That said, only doing the security fix in 3.3 would have the nice side effect of pushing people towards Python 3, so perhaps I'm for it after all. Half-jokingly, Antoine.

On Thu, Jan 5, 2012 at 3:39 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Just to make things clear - stdlib itself has 1/64 of tests relying on dict order. Changing dict order in *older* pythons will break everyone's tests and some peoples code. Making this new 2.6.x release would mean that people using new python 2.6 would have to upgrade an unspecified amount of their python packages, that does not sound very cool. Also consider that new 2.6.x would go as a security fix to old ubuntu, but all other packages won't, because they'll not contain security fixes. Just so you know Cheers, fijal

On 1/5/2012 9:34 AM, Maciej Fijalkowski wrote:
Why should CPython by constrained by broken policies of Ubuntu? If the other packages must be fixed so they work correctly with a security fix in Python, then they should be considered as containing a security fix. If they aren't, then that is a broken policy. On the other hand, it is very true that the seductive convenience of dict (readily available, good performance) in normal cases have created the vulnerability because its characteristics are a function of the data inserted, and when used for data that is from unknown, possibly malicious sources, that is a bug in the program that uses dict, not in dict itself. So it seems to me that: 1) the security problem is not in CPython, but rather in web servers that use dict inappropriately. 2) changing CPython in a way that breaks code is not a security fix to CPython, but rather a gratuitous breakage of compatibility promises, wrapped in a security-fix lie. The problem for CPython here can be summarized as follows: a) it is being blamed for problems in web servers that are not problems in CPython b) perhaps dict documentation is a bit too seductive, in not declaring that data from malicious sources could cause its performance to degrade significantly (but then, any programmer who has actually taken a decent set of programming classes should understand that, but on the other hand, there are programmers who have not taken such classes). c) CPython provides no other mapping data structures that rival the performance and capabilities of dict as an alternative, nor can such data structures be written in CPython, as the performance of dict comes not only from hashing, but also from being written in C. The solutions could be: A) push back on the blame: it is not a CPython problem B) perhaps add a warning to the documentation for the naïve, untrained programmers C) consider adding an additional data structure to the language, and mention it in the B warning for versions 3.3+. On the other hand, the web server vulnerability could be blamed on CPython in another way: identify vulnerable packages in the stdlib that are likely the be used during the parsing of user-supplied data. Ones that come to mind (Python 3.2) are: urllib.parse (various parse* functions) (package names different in Python 2.x) cgi (parse_multipart, FieldStorage) So, fixing the vulnerable packages could be a sufficient response, rather than changing the hash function. How to fix? Each of those above allocates and returns a dict. Simply have each of those allocate and return and wrapped dict, which has the following behaviors: i) during __init__, create a local, random, string. ii) for all key values, prepend the string, before passing it to the internal dict. Changing these vulnerable packages rather than the hash function is a much more constrained change, and wouldn't create bugs in programs that erroneously depend on the current hash function directly or indirectly. This would not fix web servers that use their own parsing and storage mechanism for <FORM> fields, if they have also inappropriately used a dict as their storage mechanism for user supplied data. However, a similar solution could be similarly applied by the authors of those web servers, and would be a security fix to such packages, so should be applied to Ubuntu, if available there, or other systems with security-only fix acceptance. This solution does not require changes to the hash, does not require a cryptographicly secure hash, and does not require code to be added to the initialization of Python before normal objects and mappings can be created. If a port doesn't contain a good random number generator, a weak one can be subsitituted, but such decisions can be made in Python code after the interpreter is initialized, and use of stdlib packages is available.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/05/2012 02:14 PM, Glenn Linderman wrote:
1) the security problem is not in CPython, but rather in web servers that use dict inappropriately.
Most webapp vulnerabilities are due to their use of Python's cgi module, which it uses a dict to hold the form / query string data being supplied by untrusted external users. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8F/uEACgkQ+gerLs4ltQ679QCgqKPYYwEetKR3bEMVh5eukLin cA8An3XJMYWhK5MutjbOCxCfYzKXmDzc =V3lh -----END PGP SIGNATURE-----

Am 05.01.2012 21:10, schrieb Ethan Furman:
You'd have to fix any Python core module that may handle data from untrusted sources. The issue isn't limited to web apps and POST requests. It's possible to trigger the DoS from JSON, a malicious PDF, JPEG's EXIF metadata or any other data. Oh, and somebody has to fix all 3rd party modules, too. Christian

On 1/5/2012 3:10 PM, Ethan Furman wrote:
Tres Seaver wrote:
I think both should be done. For web applications, it would be best to reject DOS attempts with 'random' keys in O(1) time rather than in O(n) time even with improved hash. But some other apps, like the Python interpreter itself, 'random' names may be quite normal. -- Terry Jan Reedy

On Fri, Jan 6, 2012 at 8:15 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Not a good idea - a lot of the 3rd party tests that depend on dict ordering are going to be using those modules anyway, so scattering our solution across half the standard library is needlessly creating additional work without really reducing the incompatibility problem. If we're going to change anything, it may as well be the string hashing algorithm itself. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/5/2012 4:10 PM, Nick Coghlan wrote:
Thanks for the implementation, Serhiy. That is the sort of thing I had in mind, indeed.
Not a good idea - a lot of the 3rd party tests that depend on dict ordering are going to be using those modules anyway,
Stats? Didn't someone post a list of tests that fail when changing the hash? Oh, those were stdlib tests, not 3rd party tests. I'm not sure how to gather the stats, then, are you?
Half the standard library? no one has cared to augment my list of modules, but I have seen reference to JSON in addition to cgi and urllib.parse. I think there are more than 6 modules in the standard library...
If we're going to change anything, it may as well be the string hashing algorithm itself.
Changing the string hashing algorithm is known (or at least no one has argued otherwise) to be a source of backward incompatibility that will break programs. My proposal (and Serhiy's implementation, assuming it works, or can be easily tweaked to work, I haven't reviewed it in detail or attempted to test it) will only break programs that have vulnerabilities. I failed to mention one other benefit of my proposal: every web request would have a different random prefix, so attempting to gather info is futile: the next request has a different random prefix, so different strings would collide.
Cheers, Nick.
Indeed it is nice when we can be cheery even when arguing, for the most part :) I've enjoyed reading the discussions in this forum because most folks have respect for other people's opinions, even when they differ.

On Thu, 5 Jan 2012 19:34:13 +0200 Maciej Fijalkowski <fijall@gmail.com> wrote:
Breaking tests is not a problem: they are typically not run by production code and so people can take the time to fix them. Breaking other code is a problem if it is legitimate. Relying on dict ordering is totally wrong and I don't think we should care about such cases. The only issue is when relying on hash() being stable accross runs. But hashing already varies from build to build (32-bit vs. 64-bit) and I think that anyone seriously relying on it should already have been bitten.
How about 2.7? Do you think it should also remain untouched? I am ok for leaving 2.6 alone (that's Barry's call anyway) but 2.7 is another matter - should people migrate to 3.x to get the security fix? As for 3.2, it should certainly get the fix IMO. There are not many Python 3 legacy applications relying on hash() stability, I think.
Ubuntu can decide *not* to ship the fix if they prefer it like that. Their policies and decisions, though, should not taint ours. Regards Antoine.

On Thu, 2012-01-05 at 19:34 +0200, Maciej Fijalkowski wrote:
We have similar issues in RHEL, with the Python versions going much further back (e.g. 2.3) When backporting the fix to ancient python versions, I'm inclined to turn the change *off* by default, requiring the change to be enabled via an environment variable: I want to avoid breaking existing code, even if such code is technically relying on non-guaranteed behavior. But we could potentially tweak mod_python/mod_wsgi so that it defaults to *on*. That way /usr/bin/python would default to the old behavior, but web apps would have some protection. Any such logic here also suggests the need for an attribute in the sys module so that you can verify the behavior.

On 5 January 2012 19:33, David Malcolm <dmalcolm@redhat.com> wrote:
Uh, surely no-one is suggesting backporting to "ancient" versions? I couldn't find the statement quickly on the python.org website (so this is via google), but isn't it true that 2.6 is in security-only mode and 2.5 and earlier will never get the fix? Having a source-only release for 2.6 means the fix is "off by default" in the sense that you can choose not to build it. Or add a #ifdef to the source if it really matters. Personally, I find it hard to see this as a Python security hole, but I can sympathise with the idea that it would be nice to make dict "safer by default". (Although the benefit for me personally would be zero, so I'm reluctant for the change to have a detectable cost...) My feeling is that it should go into 2.7, 3.2, and 3.3+, but with no bells and whistles to switch it off or the like. If it's not suitable to go in on that basis, restrict it to 3.3+ (where it's certainly OK) and advise users of earlier versions to either upgrade or code defensively to avoid hitting the pathological case. Surely that sort of defensive code should be second nature to the people who might be affected by the issue? Paul.

On Jan 05, 2012, at 08:35 PM, Paul Moore wrote:
Correct, although there's no reason why a patch for versions older than 2.6 couldn't be included on a python.org security page for reference in CVE or other security notifications. Distros that care about versions older than Python 2.6 will basically be back-porting the patch anyway.
My feeling is that it should go into 2.7, 3.2, and 3.3+, but with no bells and whistles to switch it off or the like.
I like David Malcolm's suggestion, but I have no problem applying it to 3.3, enabled by default with no way to turn it off. The off-by-default on-switch policy for stable releases would be justified by maximum backward compatibility conservativeness. -Barry

On Thu, Jan 05, 2012 at 08:35:57PM +0000, Paul Moore wrote:
I think when dmalcolm says "backporting" he means that he'll have to backport the fix from modern, supported-by-python.org python to the ancient python's that he's supporting as part of the Linux distributions where he's the python package maintainer. I'm thinking he's mentioning it here mainly to see if someone thinks that his approach for those distributions causes anyone to point out a reason not to diverge from upstream in that manner.
I don't think that this would satisfy dmalcolm's needs. What he's talking about sounds more like a runtime switch (possibly only when initializing, though, not on-the-fly). -Toshio

On Jan 05, 2012, at 02:33 PM, David Malcolm wrote:
This sounds like a reasonable compromise for all stable Python releases. It can be turned on by default for Python 3.3. If you also make the default setting easy to change (i.e. parameterized in one place), then distros can make their own decision about the default, although I'd argue for the above default approach for Debian/Ubuntu.
Any such logic here also suggests the need for an attribute in the sys module so that you can verify the behavior.
That would be read-only though, right? -Barry

Am 05.01.2012 21:45, schrieb Barry Warsaw:
Hey Barry, stop stealing my ideas! :) I've argued for these default settings for days. ver delivery randomized hashing ========================================== 2.3 patch disabled by default 2.4 patch disabled 2.5 patch disabled 2.6 release disabled 2.7 release disabled 3.0 ignore? disabled 3.1 release disabled 3.2 release disabled 3.3 n/a yet enabled by default 2.3 to 2.5 are still used in production (RHEL, Ubuntu LTS). Guido has stated that he needs a patch for 2.4, too. I think we may safely ignore Python 3.0. Nobody should use Python 3.0 on a production system. I've suggested the env var PYRANDOMHASH. It's easy to set env vars in Apache. For example Debian/Ubuntu has /etc/apache2/envvars. Settings for PYRANDOMHASH: PYRANDOMHASH=1 enable randomized hashing function PYRANDOMHASH=/path/to/seed enable randomized hashing function and read seed from 'seed' PYRANDOMHASH=0 disable randomed hashing function Since there isn't an easy way to set env vars in a shebang line since something like #!/usr/bin/env PYRANDOMHASH=1 python2.7 doesn't work, we could come up with a solution the shebang. IMHO the setting for the default setting should be a compile time option. It's reasonable easy to extend the configure script to support --enable-randomhash / --disable-randomhash. The MS VC build scripts can grow a flag, too. I still think that the topic needs a PEP. A couple of days ago I started with a PEP. But Guido told me that he doesn't see a point in a PEP because he prefers a small and quick solution, so I stopped working on it. However the arguments, worries and ideas in this enormous topic have repeated over and over. We know from experience that a PEP is a great way to explain the how, what and why of the change as well as the paths we didn't take. Christian

On Jan 05, 2012, at 10:40 PM, Christian Heimes wrote:
Hey Barry, stop stealing my ideas! :) I've argued for these default settings for days.
:)
I've suggested the env var PYRANDOMHASH. It's easy to set env vars in Apache. For example Debian/Ubuntu has /etc/apache2/envvars.
For consistency, it really should be PYTHONSOMETHING. I personally don't care how long it is (e.g. PYTHONIOENCODING).
Why not PYTHONHASHSEED then?
We have precedence for mirroring startup options and envars, so it doesn't bother me to add such a switch to Python 3.3. It *does* bother me to add a switch to any stable release.
One way to look at it is to have a quick-and-dirty solution for stable releases. It could be suboptimal from a ui point of view because of backward compatibility issues. The PEP could then outline the boffo perfect solution for Python 3.3, which a section on how it will be backported to stable releases. Cheers, -Barry

2012/1/6 Barry Warsaw <barry@python.org>:
See my patch attached to the issue #13703? I prepared the code to be able to set easily the hash seed (it has a LCG, it's seed can be provided by the user directly). I agree that the value 0 should give the same behaviour than the actual hash (disable the randomized hash). I will add the variable in the next version of my patch.

David Malcolm wrote:
Surely the way to verify the behaviour is to run this from the shell: python -c print(hash("abcde")) twice, and see that the calls return different values. (Or have I misunderstood the way the fix is going to work?) In any case, I wouldn't want to rely on the presence of a flag in the sys module to verify the behaviour, I'd want to see for myself that hash collisions are no longer predictable. -- Steven

On Fri, Jan 6, 2012 at 10:07 AM, Steven D'Aprano <steve@pearwood.info> wrote:
More directly, you can just check that the hash of the empty string is non-zero. So -1 for a flag in the sys module - "hash('') != 0" should serve as a sufficient check whether or not process-level string hash randomisation is in effect. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Jan 6, 2012 at 10:59 AM, Benjamin Peterson <benjamin@python.org> wrote:
What exactly is the disadvantage of a sys attribute? That would seem preferable to an obscure incarnation like that.
Adding sys attributes in maintenance (or security) releases makes me nervous. However, Victor and Christian are right about the need for a special case to avoid leaking information, so my particular suggested check won't work. The most robust check would be to run sys.executable in a subprocess and check if it gives the same hash for a non-empty string as the current process. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Benjamin Peterson wrote:
There's nothing obscure about directly testing the hash. That's about as far from obscure as it is possible to get: you are directly testing the presence of a feature by testing the feature. Relying on a flag to tell you whether hashes are randomised adds additional complexity: now you need to care about whether hashes are randomised AND know that there is a flag you can look up and what it is called. And since the flag won't exist in all versions of Python, or even in all builds of a particular Python version, it isn't a matter of just testing the flag, but of doing the try...except or hasattr() dance to check whether it exists first. At some point, presuming that there is no speed penalty, the behaviour will surely become not just enabled by default but mandatory. Python has never promised that hashes must be predictable or consistent, so apart from backwards compatibility concerns for old versions, future versions of Python should make it mandatory. Presuming that there is no speed penalty, I'd argue in favour of making it mandatory for 3.3. Why do we need a flag for something that is going to be always on? -- Steven

Glenn Linderman wrote:
There *may* be a speed penalty, but I draw your attention to Paul McMillian's email on 1st of January: Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. and Christian Heimes' email on 3rd of January: The changeset adds the murmur3 hash algorithm with some minor changes, for example more random seeds. At first I was worried that murmur might be slower than our old hash algorithm. But in fact it seems to be faster! So I think that it's a fairly safe bet that there will be a solution that is as fast, or at worst, trivially slower, than the current hash function. But of course, benchmarks will be needed. -- Steven

Using my patch (random-2.patch), the overhead is 0%. I cannot see a difference with and without my patch. Numbers: --- unpatched: == 3 characters == 1 loops, best of 3: 459 usec per loop == 10 characters == 1 loops, best of 3: 575 usec per loop == 500 characters == 1 loops, best of 3: 1.36 msec per loop patched: == 3 characters == 1 loops, best of 3: 458 usec per loop == 10 characters == 1 loops, best of 3: 575 usec per loop == 500 characters == 1 loops, best of 3: 1.36 msec per loop --- (the patched version looks faster just because the timer is not reliable enough for such fast test) Script: --- echo "== 3 characters ==" ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%03i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' echo "== 10 characters ==" ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%010i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' echo "== 500 characters ==" ./python -m timeit -n 1 -s 'text=(("%0500i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%0500i" % x) for x in range(1,1000))' 'sum(hash(x) for x in text)' ./python -m timeit -n 1 -s 'text=(("%0500i" % x) --- (Take the smallest timing for each test) "-n 1" is needed because the hash value is only computed once (is cached). I may be possible to have more reliable results by disabling completly the hash cache (comment "PyUnicode_HASH(self) = x;" line). Victor

Am 04.01.2012 08:59, schrieb Maciej Fijalkowski:
For example Microsoft has released an extraordinary and unscheduled security patch for the issue between Christmas and New Year. I don't normally use MS as reference but this should give you a hint about the severity. Have you watched the talk yet? http://www.youtube.com/watch?v=R2Cq3CLI6H8 Christian

Hi, Just some extra thoughts about the whole topic in the light of web applications (since this was hinted in the talk) running on Python: Yes, you can limit the number of maximum allowed parameters for post data but really there are so many places where data is parsed into hashing containers that it's quite a worthless task. Here a very brief list of things usually parsed into a dict or set and where it happens: - URL parameters and url encoded form data Generally this happens somewhere in a framework but typically also in utility libraries that deal with URLs. For instance the stdlib's cgi.parse_qs or urllib.parse.parse_qs on Python 3 do just that and that code is used left and right. Even if a framework would start limiting it's own URL parsing there is still a lot of code that does not do that the stdlib does that as well. With form data it's worse because you have multipart headers that need parsing and that is usually abstracted away so far from the user that they do not do that. Many frameworks just use the cgi module's parsing functions which also just directly feed into a dictionary. - HTTP headers. There is zero a WSGI framework can do about that since the headers are parsed into a dictionary by the WSGI server. - Incoming JSON data. Again outside of what the framework can do for the most part. simplejson can be modified to stop parsing with the hook stuff but nobody does that and since users invoke simplejson's parsing routines themselves most webapps would still be vulnerable even if all frameworks would fix the problem. - Hidden dict parameters. Things like the parameter part of content-type or the content-disposition headers are generally also just parsed into a dictionary. Likewise many frameworks parse things into set headers (for instance incoming etags). The cookie header is usually parsed into a dictionary as well. The issue is nothing new and at least my current POV on this topic was that your server should be guarded and shoot handlers of requests going rogue. Dictionaries are not the only thing that has a worst case performance that could be triggered by user input. That said. Considering that there are so many different places where things are probably close to arbitrarily long that is parsed into a dictionary or other hashing structure it's hard for a web application developer or framework to protect itself against. In case the watchdog is not a viable solution as I had assumed it was, I think it's more reasonable to indeed consider adding a flag to Python that allows randomization of hashes optionally before startup. However as it was said earlier, the attack is a lot more complex to carry out on a 64bit environment that it's probably (as it stands right now!) safe to ignore. The main problem there however is not that it's a new attack but that some dickheads could now make prebaked attacks against websites to disrupt them that might cause some negative publicity. In general though there are so many more ways to DDOS a website than this that I would rate the whole issue very low. Regards, Armin

Hi, Something I should add to this now that I thought about it a bit more: Assuming this should be fixed on a language level the solution would probably be to salt hashes. The most common hash to salt here is the PyUnicode hash for obvious reasons. - Option a: Compiled in Salt + Easy to implement - Breaks unittests most likely (those were broken in the first place but that's still a very annoying change to make) - Might cause problems with interoperability of Pythons compiled with different hash salts - You're not really solving the problem because each linux distribution (besides Gentoo I guess) would have just one salt compiled in and that would be popular enough to have the same issue. - Option b: Environment variable for the salt + Easy-ish to implement + Easy to synchronize over different machines - initialization for base types happens early and unpredictive which makes it hard for embedded Python interpreters (think mod_wsgi and other things) to specify the salt - Option c: Random salt at runtime + Easy to implement - impossible to synchronize - breaks unittests in the same way as a compiled in salt would do Where to add the salt to? Unicode strings and bytestrings (byte objects) I guess since those are the most common offenders. Sometimes tuples are keys of dictionaries but in that case a contributing factor to the hash is the string in the tuple anyways. Also related: since this is a security related issue, would this be something that goes into Python 2? Does that affect how a fix would look like? Regards, Armin

Hi, how about Option d: Host based salt + Easy-ish to implement – how about basing it on the hostname for example? + transparent for all processes on the same host - probably unit test breakage In fact, we could use host based as default with the option to specify own which would solve the sync problems. That said, I agree with Armin that fixing this in the frameworks isn't an option. Regards, Hynek Am Donnerstag, 29. Dezember 2011 um 13:57 schrieb Armin Ronacher:

+1 to option d (host-based salt) but would need to consistently order the hostnames/addresses to guarantee that all processes on the same machine got the same salt by default. +1 to option c (environment variable) as an override. And/or maybe an override on the command line. +1 to implementing the salt in the dictionary hash as an additive value. +0 to exposing the salt as a constant (3.3+ only) - or alternatively expose a hash function that just takes an existing hash and returns the salted hash. That would make it very easy for anything that wanted a salted hash to get one. For choosing the default salt, I think something like: a. If IPv6 is enabled, take the link-local address of the interface with the default route. Pretty much guaranteed not to change, can't be determined externally (salting doesn't need a secret, but it doesn't hurt), large number so probably a good salt. (If it is likely to change, a salt override should be being used instead). Don't use any other IPv6 address. In particular, never use a "temporary" IPv6" address like Windows assigns - multiprocessing could end up with instances with different salts. b. Take the FQDN of the machine. Tim Delaney

It's worth pointing out that if the salt is somehow exposed to an attacker, or is guessable, much of the benefit goes away. It's likely that a timing attack could be used to discover the salt if it is fixed per machine or process over a long period of time. If a salt is generally fixed per machine, but varies from machine-to-machine, I think we'll see an influx of frustrated devs who have something that works perfectly on their machine but not for others. It doesn't matter that they're doing it wrong, we'll still have to deal with them as a community. This seems like an argument in favor of randomizing it at runtime by default, so it fails early for them. Allowing an environment and command line override makes sense, as it allows users to rotate the salt as frequently as their needs dictate. -Paul

On Thu, 29 Dec 2011 13:57:07 +0100 Armin Ronacher <armin.ronacher@active-4.com> wrote:
This option would have my preference. I don't think hash() was ever meant to be "synchronizable". Already using a 32-bit Python will give you different results from a 64-bit Python. As for breaking unittests, these tests were broken in the first place. hash() does change from time to time.
Or it could be a process-wide constant for all dicts. If the constant is additive as proposed by Mark the impact should be negligible. (but the randomness must be good enough) Regards Antoine.

Am 29.12.2011 13:57, schrieb Armin Ronacher:
- Option d: Don't change __hash__ but only use randomized hash for PyDictEntry lookup + Easy to implement - breaks only software to relies on a fixed order of dict keys - breaks only a few to no unit tests IMHO we don't have to alter the outcome of hash("some string"), hash(1) and all other related types. We just need to reduce the change the an attacker can produce collisions in the dict (and set?) code that looks up the slot (PyDictEntry). How about adding the random value in Object/dictobject.c:lookdict() and lookdict_str() (Python 2.x) / lookdict_unicode() (Python 3.x)? With this approach the hash of all our objects stay the same and just the dict code needs to be altered. The approach has also the benefit that all possible objects gain a randomized hash.
IMHO it does affect the fix. A changed and randomized hash function may break software that relies on a stable hash() function. Christian

On Thu, 29 Dec 2011 14:32:21 +0100 Christian Heimes <lists@cheimes.de> wrote:
I basically agree with your proposal. The only downside is that custom hashed containers (such as _pickle.c's memotable) don't automatically benefit. That said, I think it would be difficult to craft an attack against the aforementioned memotable (you would have to basically choose the addresses of pickled objects). Regards Antoine.

On Thu, Dec 29, 2011 at 8:32 AM, Christian Heimes <lists@cheimes.de> wrote:
I don't understand how that helps a collision attack. If you can still generate two strings with the same (pre-randomized) hash, what difference does it make that the dict adds a random number? The post-randomized number will still be the same, no? Or does this attack just rely on the hash *remainders* being the same? If so, I can see how hashing the hash would help. But since the attacker doesn't know the modulus, and it can change as the dictionary grows, I would expect the attack to require matching hashes, not just matching hash remainders... unless I'm just completely off base here.

Am 29.12.2011 21:07, schrieb PJ Eby:
The attack doesn't need perfect collisions. The attacker calculates strings in a way so that their hashes results in as many collision as possible in the dict code. An attacker succeeds when the initial slot for an hash is filled and as many subsequent slots of the perturbed masked hash, too. Also an attacker can easily predict the size and therefore the mask for the hash remainder. A POST request parser usually starts with an empty dict and the growth rate of Python's dicts is well documented. The changing mask makes the attack just a tiny bit more challenging. The hash randomization idea adds a salt to throw the attacker of course. Instead of position = hash & mask it's now hash = salt + hash position = hash & mask where salt is a random, process global value that is fixed for the life time of the program. The salt also affects the perturbance during the search for new slots. As you already stated this salt won't be affective against full hash collisions. The attack needs A LOT of problematic strings to become an issue, possible hundred of thousands or even millions of keys in a very large POST request. In reality an attacker won't reach the full theoretical O(n^2) performance degradation for a hash table. But even more than O(n) instead of O(1) for a million keys in each request has some impact on your servers' CPUs. Some vendors have limited to POST request to 1 MB or the amount of keys to 10,000 to work around the issue. One paper also states that attacks on Python with 64bit is just academical for now. Christian

On 12/29/2011 4:31 PM, Christian Heimes wrote:
As I understood the talk (actually, the bit of Perl interpreter C code shown), the randomization is to change hash(s) to hash(salt+s) so that the salt is completely mixed into the hash from the beginning, rather than just tacked on at the end. -- Terry Jan Reedy

Am 29.12.2011 23:28, schrieb Terry Reedy:
Yes, the Perl and Ruby code uses a random seed as IV for hash generation. It's the best way to create randomized hashes but it might not be a feasible fix for Python 2.x. I'm worried that it might break applications that rely on stable hash values.

The talk was a presentation yesterday by Alexander Klink and Julian Wälde at the Chaos Communication Congress in Germany hashDoS@alech.de I read the non-technical summary at http://arstechnica.com/business/news/2011/12/huge-portions-of-web-vulnerable... and watched the video of the talk at https://www.youtube.com/watch?feature=player_embedded&v=_EEhviEO1Vo# My summary: hash table creation with N keys changes from amortized O(N) to O(N**2) time if the hash values of all the keys are the same. This should only happen for large N if done intentionally. This is easy to accomplish with a linear multiply and add hash function, such as used in PHP4 (but nowhere else that the authors found). A nonlinear multiply and xor hash function, used in one form or another by everything else, is much harder to break. It is *theoretically* vulnerable to brute-force search and this has been known for years. With a more cleaver meet-in-the-middle strategy, that builds a dict of suffixes and then searches for matching prefixes, 32-bit hashes are *practically* vulnerable. The attack depends on, for instance, 2**16 (64K) being 1/64K of 2**32. (I did not hear when this strategy was developed, but it is certainly more practical on a desktop now than even 8 years ago.) [64-bit hashes are much, much less vulnerable to attack, at least for now. So it seems to me that anyone who hashes potential attack data can avoid the problem by using 64-bit Python with 64-bit hash values. If I understood Antoine, that should be all 64-bit builds.] More summary: Perl added an #define option to start the hash calculation with non-zero value instead of 0 years ago to "avoid algorithmic complexity attacks". The patch is at 47:20 in the video. The authors believe all should do similarly. [The change amounts to adding a char, unknown to attackers, to the beginning of every string before hashing. So it adds a small bit of time. The code patch shown did not show the source of the non-zero seed or the timing and scope of any randomization. As the discussion here has shown, this is an important issue to applications. So 'do the same' is inadequate and over-simplified advice. I believe Armin's patch is similar to the Perl patch.] Since the authors sent out CERT alert about Nov 1, PHP has added to PHP5 a new function to limit the number of vars hashed. Microsoft will do something similar now with hash randomization to follow (maybe?). JRuby is going to do something. Java does not think it needs to change Java itself, but will leave all to the frameworks. [The discussion here suggests that this is an inadequate response for 32-bit systems like Java since one person/group may not control all the pieces of a server system. However, a person or group can run all pieces on a system Python with an option turned on.] -- Terry Jan Reedy

A flag will only be needed if the overhead of the fix is too high.
I suppose that there are still servers running 32 bits Python.
There are countermeasures for low level DDOS (ICMP ping flood, TCP syn flood, etc.). An application (or a firewall) cannot implement a countermeasure for this high level issue. It can only be fixed in Python directly (by changing the implementation of the dict type or of the hash function). Victor

Le 29/12/2011 02:28, Michael Foord a écrit :
A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included:
http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_...
This PDF doesn't explain exactly the problem and how it can be solved. Let's try to summarize this "vulnerability". The creation of a Python dictionary has a complexity of O(n) in most cases, but O(n^2) in the *worst* case. The attack tries to go into the worst case. It requires to compute a set of N keys having the same hash value (hash(key1) == hash(key2) == ... hash(keyN)). It only has to compute these keys once. It looks like it is now cheap enough in practice to compute this dataset for Python (and other languages). A countermeasure would be to check that we don't have more than X keys with the same hash value. But in practice, we don't know in advance how data are processed, and there are too many input vectors in various formats. If we want to fix something, it should be done in the implementation of the dict type or in the hash algorithm. We can implement dict differently to avoid this issue, using a binary tree for example. Because dict is a fundamental type in Python, I don't think that we can change its implementation (without breaking backward compatibility and so applications in production). A possibility would be to add a *new* type, but all libraries and applications would need to be changed to fix the vulnerability. The last choice is to change the hash algorithm. The *idea* is the same than adding salt to hashed password (in practice it will be a little bit different): if a pseudo-random salt is added, the attacker cannot prepare a single dataset, he/she will have to regenerate a new dataset for each possible salt value. If the salt is big enough (size in bits), the attacker will need too much CPU to generate the dataset (compute N keys with the same hash value). Basically, it slows down the attack by 2^(size of the salt). -- Another possibility would be to replace our fast hash function by a better hash function like MD5 or SHA1 (so the creation of the dataset would be too slow in practice = too expensive), but cryptographic hash functions are much slower (and so would slow down Python too much). Limiting the size of the POST data doesn't solve the problem because there are many other input vectors and data formats. It may block the most simple attacks because the attacker cannot inject enough keys to slow down your CPU. Victor

Am 31.12.2011 03:22, schrieb Victor Stinner:
Correct. The meet-in-the-middle attack and the unique properties of algorithms that are similar to DJBX33A and DJBX33A make the attack easy on platforms with 32bit hash.
A BTree is too slow for common operations, it's O(log n) instead of O(1) in average. We can't replace our dict with a btree type. A new btree type is a lot of work, too. The unique structure of CPython's dict implementation makes it harder to get the number of values with equal hash. The academic hash map (the one I learnt about at university) uses a bucket to store all elements with equal hash (more precise hash: mod mask). However Python's dict however perturbs the hash until it finds a free slot its array. The second, third ... collision can be caused by a legit and completely different (!) hash.
That's the idea of randomized hashing functions as implemented by Ruby 1.8, Perl and others. The random seed is used as IV. Multiple rounds of multiply, XOR and MOD (integer overflows) cause a deviation. In your other posting you were worried about the performance implication. A randomized hash function just adds a single ADD operation, that's all. Downside: With randomization all hashes are unpredictable and change after every restart of the interpreter. This has some subtle side effects like a different outcome of {a:1, b:1, c:1}.keys() after a restart of the interpreter.
I agree with your analysis. Cryptographic hash functions are far too slow for our use case. During my research I found another hash function that claims to be fast and that may not be vulnerable to this kind of attack: http://isthe.com/chongo/tech/comp/fnv/ Christian

Christian Heimes, 31.12.2011 04:59:
Wouldn't Bob Jenkins' "lookup3" hash function fit in here? After all, it's portable, known to provide a very good distribution for different string values and is generally fast on both 32 and 64 bit architectures. http://burtleburtle.net/bob/c/lookup3.c The analysis is here: http://burtleburtle.net/bob/hash/doobs.html It seems that there's also support for generating 64bit hash values (actually 2x32bits) efficiently. Admittedly, this may require some adaptation for the PEP393 unicode memory layout in order to produce identical hashes for all three representations if they represent the same content. So it's not a drop-in replacement. Stefan

Am 07.01.2012 12:02, schrieb Stefan Behnel:
This thread as well as the ticket is getting so long that people barely have a chance to catch up ... Guido has stated that he doesn't want a completely new hash algorithm for Python 2.x to 3.2. A new hash algorithm for 3.3 needs a PEP, too. I've done some experiments with FNV and Murmur3. With Murmur3 128bit I've seen some minor speed improvements on 64bit platforms. At first I was surprised but it makes sense. Murmur3 operates on uint32_t blocks while Python's hash algorithm iterates over 1 byte (bytes, ASCII), 2 bytes (USC2) or 4 bytes (USC4) types. Since most strings are either ASCII or UCS2, the inner loop of the current algorithm is more tight.
Is this condition required and implemented at the moment? Christian

On 1/7/2012 12:57 PM, Christian Heimes wrote:
Am 07.01.2012 12:02, schrieb Stefan Behnel:
If o1 == o2, then hash(o1) == hash(o2) is an unstated requirement implied by "They [hash values] are used to quickly compare dictionary keys during a dictionary lookup." since hash(o1) != hash(o2) is taken to mean o1 != o2 (whereas hash(o1) == hash(o2) is taken to mean o1 == o2 is possible but must be checked). Hashing should be a coarsening of == as an equivalence relationship. -- Terry Jan Reedy

Victor Stinner writes:
I don't know the implementation issues well enough to claim it is a solution, but this hasn't been mentioned before AFAICS: While the dictionary probe has to start with a hash for backward compatibility reasons, is there a reason the overflow strategy for insertion has to be buckets containing lists? How about double-hashing, etc?

Am 31.12.2011 13:03, schrieb Stephen J. Turnbull:
Python's dict implementation doesn't use bucket but open addressing (aka closed hashed table). The algorithm for conflict resolution doesn't use double hashing. Instead it takes the original and (in most cases) cached hash and perturbs the hash with a series of add, multiply and bit shift ops.

Christian Heimes writes:
In an attack, this is still O(collisions) per probe (as any scheme where the address of the nth collision is a function of only the hash), where double hashing should be "roughly" O(1) (with double the coefficient). But that evidently imposes too large a performance burden on non-evil users, so it's not worth thinking about whether "roughly O(1)" is close enough to O(1) to deter or exhaust attackers. I withdraw the suggestion.

On Sat, Dec 31, 2011 at 7:03 AM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
This won't help, because the keys still have the same hash value. ANYTHING you do to them after they're generated will result in them still colliding. The *only* thing that works is to change the hash function in such a way that the strings end up with different hashes in the first place. Otherwise, you'll still end up with (deliberate) collisions. (Well, technically, you could use trees or some other O log n data structure as a fallback once you have too many collisions, for some value of "too many". Seems a bit wasteful for the purpose, though.)

I don't think that would be wasteful. You wouldn't just use the tree for the case of too many collisions, but for any collision. You might special-case the case of a single key, i.e. start using the tree only if there is a collision. The issue is not the effort, but the need to support ordering if you want to use trees. So you might restrict this to dicts that have only str keys (which in practice should never have any collision, unless it's a deliberate setup). I'd use the tagged-pointer trick to determine whether a key is an object pointer (tag 0) or an AVL tree (tag 1). So in the common case of interned strings, the comparison for pointer equality (which is the normal case if the keys are interned) will succeed quickly; if pointer comparison fails, check the tag bit. Regards, Martin
participants (43)
-
Alex Gaynor
-
Alexey Borzenkov
-
Anders J. Munch
-
Andrew Bennetts
-
Antoine Pitrou
-
Armin Ronacher
-
Barry Warsaw
-
Benjamin Peterson
-
Bill Janssen
-
Brian Curtin
-
Christian Heimes
-
David Malcolm
-
Eric Snow
-
Ethan Furman
-
Fred Drake
-
Georg Brandl
-
geremy condra
-
Glenn Linderman
-
Guido van Rossum
-
Hynek Schlawack
-
Jeffrey Yasskin
-
Jesse Noller
-
M.-A. Lemburg
-
Maciej Fijalkowski
-
Mark Shannon
-
martin@v.loewis.de
-
Michael Foord
-
Ned Batchelder
-
Nick Coghlan
-
Paul McMillan
-
Paul Moore
-
PJ Eby
-
Raymond Hettinger
-
Serhiy Storchaka
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Delaney
-
Toshio Kuratomi
-
Tres Seaver
-
Victor Stinner
-
Victor Stinner