Algoritmic Complexity Attack on Python
Hello. We have analyzed this software to determine its vulnerability to a new class of DoS attacks that related to a recent paper. ''Denial of Service via Algorithmic Complexity Attacks.'' This paper discusses a new class of denial of service attacks that work by exploiting the difference between average case performance and worst-case performance. In an adversarial environment, the data structures used by an application may be forced to experience their worst case performance. For instance, hash tables are usually thought of as being constant time operations, but with large numbers of collisions will degrade to a linked list and may lead to a 100-10,000 times performance degradation. Because of the widespread use of hash tables, the potential for attack is extremely widespread. Fortunately, in many cases, other limits on the system limit the impact of these attacks. To be attackable, an application must have a deterministic or predictable hash function and accept untrusted input. In general, for the attack to be signifigant, the applications must be willing and able to accept hundreds to tens of thousands of 'attack inputs'. Because of that requirement, it is difficult to judge the impact of these attack without knowing the source code extremely well, and knowing all ways in which a program is used. As part of this project, I have examined python 2.3b1, and the hash function 'string_hash' is deterministic. Thus any script that may hash untrusted input may vulnerable to our attack. Furthermore, the structure of the hash functions allows our fast collision generation algorithm to work. This means that any script written in python that hashes a large number of keys from an untrusted source is potentially subject to a severe performance degradation. Depending on the application or script, this could be a critical DoS. The solution for these attacks on hash tables is to make the hash function unpredictable via a technique known as universal hashing. Universal hashing is a keyed hash function where, based on the key, one of a large set hash functions is chosen. When benchmarking, we observe that for short or medium length inputs, it is comparable in performance to simple predictable hash functions such as the ones in Python or Perl. Our paper has graphs and charts of our benchmarked performance. I highly advise using a universal hashing library, either our own or someone elses. As is historically seen, it is very easy to make silly mistakes when attempting to implement your own 'secure' algorithm. The abstract, paper, and a library implementing universal hashing is available at http://www.cs.rice.edu/~scrosby/hash/. Scott
For instance, hash tables are usually thought of as being constant time operations, but with large numbers of collisions will degrade to a linked list and may lead to a 100-10,000 times performance degradation.
True enough. And it's not hard to create tons of keys that will collide (Uncle Tim even gives an example in the source for those who care to read). Going from O(1) to O(n) for each insertion would be a bit painful during the process of building up a large dictionary. So, did your research show a prevalence of or even existence of online applications that allow someone to submit high volumes of meaningless keys to be saved in a hash table? Raymond Hettinger
On Thu, 29 May 2003 16:55:35 -0400, "Raymond Hettinger" <python@rcn.com> writes:
So, did your research show a prevalence of or even existence of online applications that allow someone to submit high volumes of meaningless keys to be saved in a hash table?
I am not a python guru and We weren't looking for specific applications, so I wouldn't know. Scott
Scott, I just too a minute too look at this. I downloaded the python-attack file from your Web site. I loading all the strings and then inserted them into a dictionary. I also generated a list of 10,000 random strings and inserted them into a dictionary. The script is below. The results show that inserting the python-attack strings is about 4 times slower than inserting random strings. slothrop:~/src/python/dist/src/build> ./python ~/attack.py ~/python-attack time 0.0898009538651 size 10000 slothrop:~/src/python/dist/src/build> ./python ~/attack.py ~/simple time 0.0229719877243 size 10000 Jeremy import time def main(path): L = [l.strip() for l in open(path)] d = {} t0 = time.time() for k in L: d[k] = 1 t1 = time.time() print "time", t1 - t0 print "size", len(d) if __name__ == "__main__": import sys main(sys.argv[1])
On 29 May 2003 17:01:19 -0400, Jeremy Hylton <jeremy@zope.com> writes:
Scott,
I just too a minute too look at this. I downloaded the python-attack file from your Web site. I loading all the strings and then inserted them into a dictionary. I also generated a list of 10,000 random strings and inserted them into a dictionary.
Ok. It should have taken almost a minute instead of .08 seconds in the attack version. My file is broken. I'll be constructing a new one later this evening. If you test perl with the perl files, you'll see what should have occured in this case.
The script is below.
Thank you. Scott
[Jeremy Hylton]
I just too a minute too look at this. I downloaded the python-attack file from your Web site. I loading all the strings and then inserted them into a dictionary. I also generated a list of 10,000 random strings and inserted them into a dictionary.
[Scott Crosby]
Ok. It should have taken almost a minute instead of .08 seconds in the attack version. My file is broken. I'll be constructing a new one later this evening. If you test perl with the perl files, you'll see what should have occured in this case.
Note that the 10,000 strings in the file map to 400 distinct 32-bit hash codes under Python's hash. It's not enough to provoke worst-case behavior in Python just to collide on the low-order bits: all 32 bits contribute to the probe sequence (just colliding on the initial bucket slot doesn't have much effect). As is, it's effectively creating 400 collision chains, ranging in length from 7 to 252, with a mean length of 25 and a median of 16.
On Thu, 29 May 2003 18:31:56 -0400, "Tim Peters" <tim@zope.com> writes:
[Jeremy Hylton]
I just too a minute too look at this. I downloaded the python-attack file from your Web site. I loading all the strings and then inserted them into a dictionary. I also generated a list of 10,000 random strings and inserted them into a dictionary.
[Scott Crosby]
Ok. It should have taken almost a minute instead of .08 seconds in the attack version. My file is broken. I'll be constructing a new one later this evening. If you test perl with the perl files, you'll see what should have occured in this case.
Note that the 10,000 strings in the file map to 400 distinct 32-bit hash codes under Python's hash. It's not enough to provoke worst-case behavior
Yes. Jeremey has made me aware of this. I appear to have made a mistake when inserting python's hash code into my program that finds generators. The fact that I find so many collisions, but I don't have everything colliding indicates what the problem is.
in Python just to collide on the low-order bits: all 32 bits contribute to the probe sequence (just colliding on the initial bucket slot doesn't have much effect).
Correct. My program, as per the paper, generates full 32 bit hash collisions. It doesn't generate bucket collisions.
As is, it's effectively creating 400 collision chains, ranging in length from 7 to 252, with a mean length of 25 and a median of 16.
Yes. I don't know python so I had no test program to verify that my attack file was correct. Its not. Now that I have one, I'll quash the bug and release a new file later this evening. Scott
I've got one meta-comment here: [Scott A Crosby]
Hello. We have analyzed this software to determine its vulnerability to a new class of DoS attacks that related to a recent paper. ''Denial of Service via Algorithmic Complexity Attacks.''
I don't think this is new. For example, a much simpler kind of attack is to exploit the way backtracking regexp engines work -- it's easy to find regexp + target_string combos that take time exponential in the sum of the lengths of the input strings. It's not so easy to recognize such a pair when it's handed to you. In Python, exploiting unbounded-int arithmetic is another way to soak up eons of CPU with few characters, e.g. 10**10**10 will suck up all your CPU *and* all your RAM. Another easy way is to study a system's C qsort() implementation, and provoke it into quadratic-time behavior (BTW, McIlroy wrote a cool paper on this in '98: http://www.cs.dartmouth.edu/~doug/mdmspe.pdf ). I'm uninterested in trying to "do something" about these. If resource-hogging is a serious potential problem in some context, then resource limitation is an operating system's job, and any use of Python (or Perl, etc) in such a context should be under the watchful eyes of OS subsystems that track actual resource usage.
On Thu, 29 May 2003 21:54:20 -0400, Tim Peters <tim.one@comcast.net> writes:
I've got one meta-comment here:
[Scott A Crosby]
Hello. We have analyzed this software to determine its vulnerability to a new class of DoS attacks that related to a recent paper. ''Denial of Service via Algorithmic Complexity Attacks.''
I don't think this is new. For example, a much simpler kind of attack is to exploit the way backtracking regexp engines work -- it's easy to find regexp + target_string combos that take time exponential in the sum of the lengths of the input strings. It's not so easy to recognize such a pair when it's handed to you. In Python, exploiting unbounded-int arithmetic is another way to soak up eons of CPU with few characters, e.g.
10**10**10
These ways require me having the ability to feed a program, an expression, or a regular expression into the victim's python interpreter. The attack I discuss only require that it hash some arbitrary input by the attacker, so these attacks apply in many more cases.
will suck up all your CPU *and* all your RAM. Another easy way is to study a system's C qsort() implementation, and provoke it into quadratic-time behavior (BTW, McIlroy wrote a cool paper on this in '98:
This is a very cool paper in exactly the same vein as ours. Thanks.
I'm uninterested in trying to "do something" about these. If resource-hogging is a serious potential problem in some context, then resource limitation is an operating system's job, and any use of Python (or Perl, etc) in such a context should be under the watchful eyes of OS subsystems that track actual resource usage.
I disagree. Changing the hash function eliminates these attacks on hash tables. Scott
[Scott A Crosby]
These ways require me having the ability to feed a program, an expression, or a regular expression into the victim's python interpreter.
I think you underestimate the perversity of regular expressions in particular. Many programs use (fixed) regexps to parse input, and it's often possible to construct inputs that take horribly long times to match, or, especially, to fail to match. Blowing the hardware stack (provoking a memory fault) is also usually easy. The art of writing robust regexps for use with a backtracking engine is obscure and difficult (see Friedl's "Mastering Regular Expressions" (O'Reilly) for a practical intro to the topic), and regexps are ubiquitous now.
The attack I discuss only require that it hash some arbitrary input by the attacker, so these attacks apply in many more cases.
While a regexp attack only requires that the program parse one user-supplied string <wink>
This is a very cool paper in exactly the same vein as ours. Thanks.
It is a cool paper, and you're welcome. I don't think it's in the same vein, though -- McIlroy presented it as an interesting discovery, not as "a reason" for people to get agitated about programs using quicksort. The most likely reason you didn't find references to it before is because nobody in real life cares much about this attack possibility.
I'm uninterested in trying to "do something" about these. If resource-hogging is a serious potential problem in some context, then resource limitation is an operating system's job, and any use of Python (or Perl, etc) in such a context should be under the watchful eyes of OS subsystems that track actual resource usage.
I disagree. Changing the hash function eliminates these attacks on hash tables.
It depends on how much access an attacker has, and, as you said before, you're not aware of any specific application in Python that *can* be attacked this way. I'm not either. In any case, the universe of resource attacks is much larger than just picking on hash functions, so plugging a hole in those alone wouldn't do anything to ease my fears -- provided I had such fears, which is admittedly a stretch <wink>. If I did have such fears, I'd want the OS to alleviate them all at once.
[Tim Peters]
I'm uninterested in trying to "do something" about these. If resource-hogging is a serious potential problem in some context, then resource limitation is an operating system's job, and any use of Python (or Perl, etc) in such a context should be under the watchful eyes of OS subsystems that track actual resource usage.
[Scott Crosby]
I disagree. Changing the hash function eliminates these attacks on hash tables.
At what cost for Python? 99.99% of all Python programs are not vulnerable to this kind of attack, because they don't take huge amounts of arbitrary input from an untrusted source. If the hash function you propose is even a *teensy* bit slower than the one we've got now (and from your description I'm sure it has to be), everybody would be paying for the solution to a problem they don't have. You keep insisting that you don't know Python. Hashing is used an awful lot in Python -- as an interpreted language, most variable lookups and all method and instance variable lookups use hashing. So this would affect every Python program. Scott, we thank you for pointing out the issue, but I think you'll be wearing out your welcome here quickly if you keep insisting that we do things your way based on the evidence you've produced so far. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Fri, 30 May 2003 07:39:18 -0400, Guido van Rossum <guido@python.org> writes:
[Tim Peters]
I'm uninterested in trying to "do something" about these. If resource-hogging is a serious potential problem in some context, then resource limitation is an operating system's job, and any use of Python (or Perl, etc) in such a context should be under the watchful eyes of OS subsystems that track actual resource usage.
[Scott Crosby]
I disagree. Changing the hash function eliminates these attacks on hash tables.
At what cost for Python? 99.99% of all Python programs are not vulnerable to this kind of attack, because they don't take huge amounts of arbitrary input from an untrusted source. If the hash function you propose is even a *teensy* bit slower than the one we've got now (and from your description I'm sure it has to be), everybody
We included several benchmarks in our paper: On here, http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/CacheEffect... when we compare hash functions as the working set changes, we notice that a single L2 cache miss far exceeds hashing time for all algorithms. On http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/LengthEffec... UHASH exceeds the performance of perl's hash function, which is simpler than your own. Even for small strings, UHASH is only about half the speed of perl's hash function, and your function already performs a multiplication per byte: #define HASH(hi,ho,c) ho = (1000003*hi) ^ c #define HASH0(ho,c) ho = ((c << 7)*1000003) ^ c The difference between this and CW12 is one 32 bit modulo operation. (Please note that CW12 is currently broken. Fortunately it didn't affect the benchmarking on x86.)
would be paying for the solution to a problem they don't have. You keep insisting that you don't know Python. Hashing is used an awful lot in Python -- as an interpreted language, most variable lookups and all method and instance variable lookups use hashing. So this would affect every Python program.
Have you done benchmarking to prove that string_hash is in fact an important hotspot in the python interpreter? If so, and doing one modulo operation per string is unacceptable, then you may wish to consider Jenkin's hash. The linux kernel people are switching to using a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no proofs that it is in fact universal. It, however, probably is safe. It is not unlikely that if you went that route you'd be somewhat safer, and faster, but if you want full safety, you'd need to go with a universal hash. Scott
On Fri, May 30, 2003 at 11:18:14AM -0500, Scott A Crosby wrote:
On
http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/LengthEffec...
UHASH exceeds the performance of perl's hash function, which is simpler than your own.
I notice that you say "with strings longer than around 44-bytes, UHASH dominates all the other hash functions, due in no small part to its extensive performance tuning and *hand-coded assembly routines.*" [emphasis mine] It's all well and good for people who can run your hand-coded VAX assembly, but when Intel's 80960* comes out and people start running Unix on it, won't they be forced to code your hash function all over again? Since everybody has hopes for Python beyond the VAX (heck, in 20 years VAX might have as little as 5% of the market -- anything could happen) there has been a consious decision not to hand-code anything in assembly in Python. Jeff * The Intel 80960, in case you haven't heard of it, is a superscalar processor that will require highly-tuned compilers and will run like a bat out of hell when the code is tuned right. I think it's capable of one floating-point and two integer instructions per cycle!
On Fri, 30 May 2003 15:00:21 -0500, Jeff Epler <jepler@unpythonic.net> writes:
On Fri, May 30, 2003 at 11:18:14AM -0500, Scott A Crosby wrote:
On
http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/LengthEffec...
UHASH exceeds the performance of perl's hash function, which is simpler than your own.
I notice that you say "with strings longer than around 44-bytes, UHASH dominates all the other hash functions, due in no small part to its extensive performance tuning and *hand-coded assembly routines.*" [emphasis mine] It's all well and good for people who can run your
We benchmarked it, and without assembly optimizations, uhash still exceeds perl. Also please note that we did not create uhash. We merely used it as a high performance universal hash which we could cite and benchmark. Freshly computed raw benchmarks on a P2-450 are at the end of this email. Looking at them now. I think we may have slightly err'ed and used the non-assembly version of the hash in constructing the graphs, because the crossover point compared to perl looks to be about 20 bytes with assembly, and about 48 without. Roughly, they show that uhash is about half the speed on a P2-450 without assembly. I do not have benchmarks on other platforms to compare it to. However, CW is known be about 10 times worse, relatively, than jenkin's on a SPARC. The python community will have to judge whether the performance difference of the current hash is worth the risk of the attack. Also, I'd like to thank Tim Peters for telling me about the potential of degradation that regular expressions may offer. Scott Time benchmarking actual hash (including benchmarking overhead) with a working set size of 12kb. Time(perl-5.8.0): 12.787 Mbytes/sec with string length 4, buf 12000 Time(uh_cw-1024): 6.010 Mbytes/sec with string length 4, buf 12000 Time(python): 14.952 Mbytes/sec with string length 4, buf 12000 Time(test32out_uhash): 4.584 Mbytes/sec with string length 4, buf 12000 Time(test32out_assembly_uhash): 6.014 Mbytes/sec with string length 4, buf 12000 Time(perl-5.8.0): 29.125 Mbytes/sec with string length 16, buf 12000 Time(uh_cw-1024): 11.898 Mbytes/sec with string length 16, buf 12000 Time(python): 36.445 Mbytes/sec with string length 16, buf 12000 Time(test32out_uhash): 19.169 Mbytes/sec with string length 16, buf 12000 Time(test32out_assembly_uhash): 25.660 Mbytes/sec with string length 16, buf 12000 Time(perl-5.8.0): 45.440 Mbytes/sec with string length 64, buf 12000 Time(uh_cw-1024): 16.168 Mbytes/sec with string length 64, buf 12000 Time(python): 62.213 Mbytes/sec with string length 64, buf 12000 Time(test32out_uhash): 71.396 Mbytes/sec with string length 64, buf 12000 Time(test32out_assembly_uhash): 106.873 Mbytes/sec with string length 64, buf 12000 Time benchmarking actual hash (Including benchmarking overhead) with a working set size of 6mb. Time(perl-5.8.0): 8.099 Mbytes/sec with string length 4, buf 6000000 Time(uh_cw-1024): 4.660 Mbytes/sec with string length 4, buf 6000000 Time(python): 8.840 Mbytes/sec with string length 4, buf 6000000 Time(test32out_uhash): 3.932 Mbytes/sec with string length 4, buf 6000000 Time(test32out_assembly_uhash): 4.859 Mbytes/sec with string length 4, buf 6000000 Time(perl-5.8.0): 20.878 Mbytes/sec with string length 16, buf 6000000 Time(uh_cw-1024): 9.964 Mbytes/sec with string length 16, buf 6000000 Time(python): 24.450 Mbytes/sec with string length 16, buf 6000000 Time(test32out_uhash): 16.168 Mbytes/sec with string length 16, buf 6000000 Time(test32out_assembly_uhash): 19.929 Mbytes/sec with string length 16, buf 6000000 Time(perl-5.8.0): 35.265 Mbytes/sec with string length 64, buf 6000000 Time(uh_cw-1024): 14.400 Mbytes/sec with string length 64, buf 6000000 Time(python): 46.650 Mbytes/sec with string length 64, buf 6000000 Time(test32out_uhash): 48.719 Mbytes/sec with string length 64, buf 6000000 Time(test32out_assembly_uhash): 63.523 Mbytes/sec with string length 64, buf 6000000
At 05:02 PM 5/30/03 -0500, Scott A Crosby wrote:
The python community will have to judge whether the performance difference of the current hash is worth the risk of the attack.
Note that the "community" doesn't really have to judge. An individual developer can, if they have an application they deem vulnerable, do something like this: class SafeString(str): def __hash__(self): # code to return a hash code safe = SafeString(string_from_untrusted_source) and then use only these "safe" strings as keys for a given dictionary. Or, with a little more work, they can subclass 'dict' and make a dictionary that converts its keys to "safe" strings. As far as current vulnerability goes, I'd say that the most commonly available attack point for this would be CGI programs that accept POST operations. A POST can supply an arbitrarily large number of form field keys. If you can show that the Python 'cgi' module is vulnerable to such an attack, in a dramatic disproportion to the size of the data transmitted (since obviously it's as much of a DoS to flood a script with a large quantity of data), then it might be worth making changes to the 'cgi' module, or at least warning the developers of alternatives to CGI (e.g. Zope, Quixote, SkunkWeb, CherryPy, etc.) that alternate hashes might be a good idea. But based on the discussion so far, I'm not sure I see how this attack would produce an effect that was dramatically disproportionate to the amount of data transmitted.
On Sat, 31 May 2003 09:17:16 -0400, "Phillip J. Eby" <pje@telecommunity.com> writes:
At 05:02 PM 5/30/03 -0500, Scott A Crosby wrote:
But based on the discussion so far, I'm not sure I see how this attack would produce an effect that was dramatically disproportionate to the amount of data transmitted.
I apologize for not having this available earlier, but a corrected file of 10,000 inputs is now available and shows the behavior I claimed. (Someone else independently reimplemented the attack and has sent me a corrected set for python.) With 10,000 inputs, python requires 19 seconds to process instead of .2 seconds. A file of half the size requires 4 seconds, showing the quadratic behavior, as with the case of perl. (Benchmarked on a P2-450) I thus predict that twice the inputs would take about 80 seconds. I can only guess what python applications might experience an interesting impact from this, so I'll be silent. However, here are the concrete benchmarks. Scott
On Sat, May 31, 2003 at 10:48:28AM -0500, Scott A Crosby wrote:
On Sat, 31 May 2003 09:17:16 -0400, "Phillip J. Eby" <pje@telecommunity.com> writes:
At 05:02 PM 5/30/03 -0500, Scott A Crosby wrote:
But based on the discussion so far, I'm not sure I see how this attack would produce an effect that was dramatically disproportionate to the amount of data transmitted.
I apologize for not having this available earlier, but a corrected file of 10,000 inputs is now available and shows the behavior I claimed. (Someone else independently reimplemented the attack and has sent me a corrected set for python.) With 10,000 inputs, python requires 19 seconds to process instead of .2 seconds. A file of half the size requires 4 seconds, showing the quadratic behavior, as with the case of perl. (Benchmarked on a P2-450) I thus predict that twice the inputs would take about 80 seconds.
I can only guess what python applications might experience an interesting impact from this, so I'll be silent. However, here are the concrete benchmarks.
the CGI module was mentioned earlier as a possible "problem area" for this attack, I wrote a script that demonstrates this, using Scott's list of hash-colliding strings. I do see quadratic growth in runtime. When runnng the attack on mailman, however, I don't see such a large runtime, and the growth in runtime appears to be linear. This may be because the mailman installation is running on 2.1 (?) and requires a different set of attack strings. I used the cgi.py "self-test" script (the one you get when you run cgi.py *as* a cgi script) on the CGIHTTPServer.py server, and sent a long URL of the form test.cgi?x=1&<colliding key 1>=1&<colliding key 2>=1&... I looked at the size of the URL, the size of the response, and the time to transfer the response. My system is a mobile Pentium III running at 800MHz, RedHat 9, Python 2.2.2. The mailman testing system is a K6-2 running at 350MHz, RedHat 7.1, Python 2.1. In the results below, the very fast times and low reply sizes are due to the fact that the execve() call fails for argv+envp>128kb. This limitation might not exist if the CGI was POSTed, or running as fcgi, mod_python, or another system which does not pass the GET form contents in the environnment. Here are the results, for various query sizes: ######################################################################## # Output 1: Running attack in listing 1 on cgi.py # Parameters in query: 0 Length of URL: 40 Length of contents: 2905 Time for request: 0.537268042564 # Parameters in query: 1 Length of URL: 64 Length of contents: 3001 Time for request: 0.14549601078 # Parameters in query: 10 Length of URL: 307 Length of contents: 5537 Time for request: 0.151428103447 # Parameters in query: 100 Length of URL: 2737 Length of contents: 31817 Time for request: 0.222425937653 # Parameters in query: 1000 Length of URL: 27037 Length of contents: 294617 Time for request: 4.47611808777 # Parameters in query: 2000 Length of URL: 54037 Length of contents: 586617 Time for request: 18.8749380112 # Parameters in query: 4800 Length of URL: 129637 Length of contents: 1404217 Time for request: 106.951847911 # Parameters in query: 5000 Length of URL: 135037 Length of contents: 115 Time for request: 0.516644954681 # Parameters in query: 10000 Length of URL: 270037 Length of contents: 115 Time for request: 1.01809692383 When I attempted to run the attack against Apache 1.3/Mailman, any moderately-long GET requests provoked an Apache error message. ######################################################################## # Listing 1: test_cgi.py import urllib, time def time_url(url): t = time.time() u = urllib.urlopen(url) contents = u.read() t1 = time.time() print "Length of URL:", len(url) print "Length of contents:", len(contents) print contents[:200] print "Time for request:", t1-t print #URL="http://www.example.com/mailman/subscribe/test" URL="http://localhost:8000/cgi-bin/test.cgi" items = [line.strip() for line in open("python-attack").readlines()] for i in (0, 1, 10, 100, 1000, 2000, 4800, 5000, 10000): print "# Parameters in query:", i url = URL+"?x" url = url + "=1&".join(items[:i]) time_url(url) ######################################################################## I re-wrote the script to use POST instead of GET, and again ran it on cgi.py and mailman. For some reason, using 0 or 1 items aginst CGIHTTPServer.py seemed to hang. ######################################################################## # Output 2: Running attack in listing2 on cgi.py # Parameters in query: 10 Length of URL: 38 Length of data: 272 Length of contents: 3543 Time for request: 0.314235925674 # Parameters in query: 100 Length of URL: 38 Length of data: 2702 Length of contents: 13894 Time for request: 0.218624949455 # Parameters in query: 1000 Length of URL: 38 Length of data: 27002 Length of contents: 117395 Time for request: 2.20617306232 # Parameters in query: 2000 Length of URL: 38 Length of data: 54002 Length of contents: 232395 Time for request: 9.92248606682 # Parameters in query: 5000 Length of URL: 38 Length of data: 135002 Length of contents: 577396 Time for request: 57.3930220604 # Parameters in query: 10000 Length of URL: 38 Length of data: 270002 Length of contents: 1152396 Time for request: 238.318212986 ######################################################################## # Output 3: Running attack in listing2 on mailman # Parameters in query: 10 Length of URL: 44 Length of data: 272 Length of contents: 852 Time for request: 0.938691973686 # Parameters in query: 100 Length of URL: 44 Length of data: 2702 Length of contents: 852 Time for request: 0.819067001343 # Parameters in query: 1000 Length of URL: 44 Length of data: 27002 Length of contents: 852 Time for request: 1.13541901112 # Parameters in query: 2000 Length of URL: 44 Length of data: 54002 Length of contents: 852 Time for request: 1.59714698792 # Parameters in query: 5000 Length of URL: 44 Length of data: 135002 Length of contents: 852 Time for request: 3.12452697754 # Parameters in query: 10000 Length of URL: 44 Length of data: 270002 Length of contents: 852 Time for request: 5.72900700569 ######################################################################## # Listing 2: attack program using POST for longer URLs import urllib2, time def time_url(url, data): t = time.time() u = urllib2.urlopen(url, data) contents = u.read() t1 = time.time() print "Length of URL:", len(url) print "Length of data:", len(data) print "Length of contents:", len(contents) print "Time for request:", t1-t print #URL="http://www.example.com/mailman/subscribe/test" URL="http://localhost:8000/cgi-bin/test.cgi" items = [line.strip() for line in open("python-attack").readlines()] for i in (10, 100, 1000, 2000, 5000, 10000): print "# Parameters in query:", i data = "x" + "=1&".join(items[:i]) + "\r\n\r\n" time_url(URL, data)
[Phillip J. Eby]
... or at least warning the developers of alternatives to CGI (e.g. Zope, Quixote, SkunkWeb, CherryPy, etc.) that alternate hashes might be a good idea.
Don't know about SkunkWeb or CherryPy etc, but Zope and Quixote apps can use ZODB's BTrees for mappings. Insertion and lookup in a BTree have worst-case log-time behavior, and no "bad" sets of keys exist for them. The Buckets forming the leaves of BTrees are vulnerable, though: provoking quadratic-time behavior in a Bucket only requires inserting keys in reverse-sorted order, and sometimes apps use Buckets directly when they should be using BTrees. Using a data structure appropriate for the job at hand is usually a good idea <wink>.
[Scott Crosby]
... Also, I'd like to thank Tim Peters for telling me about the potential of degradation that regular expressions may offer.
I'm acutely aware of that one because it burns people regularly. These aren't cases of hostile input, they're cases of innocently "erroneous" input. After maybe a year of experience, people using a backtracking regexp engine usually figure out how to write a regexp that doesn't go resource-crazy when parsing strings that *do* match. Those are the inputs the program expects. But all inputs can suffers errors, and a regexp that works well when the input matches can still go nuts trying to match a non-matching string, consuming an exponential amount of time trying an exponential number of futile backtracking possibilities. Here's an unrealistic but tiny example, to get the flavor across: """ import re pat = re.compile('(x+x+)+y') from time import clock as now for n in range(10, 100): print n, start = now() pat.search('x' * n + 'y') print now() - start, start = now() pat.search('x' * n) print now() - start """ The fixed regexp here is (x+x+)+y and we search strings of the form xxx...xxxy which do match xxx...xxx which don't match The matching cases take time linear in the length of the string, but it's so fast it's hard to see the time going up at all until the string gets very large. The failing cases take time exponential in the length of the string. Here's sample output: 10 0.000155885951826 0.00068891533549 11 1.59238337887e-005 0.0013736401884 12 1.76000268191e-005 0.00268777552423 13 2.43047989406e-005 0.00609379976198 14 2.51428954558e-005 0.0109438642954 15 3.4361957123e-005 0.0219815954005 16 3.10095710622e-005 0.0673058549423 17 3.26857640926e-005 0.108308050755 18 3.35238606078e-005 0.251965336328 19 3.68762466686e-005 0.334131480581 20 3.68762466685e-005 0.671073936875 21 3.60381501534e-005 1.33723327578 22 3.60381501534e-005 2.68076149449 23 3.6038150153e-005 5.37420757974 24 3.6038150153e-005 10.7601803584 25 3.52000536381e-005 I killed the program then, as I didn't want to wait 20+ seconds for the 25-character string to fail to match. The horrid problem here is that it takes a highly educated eye to look at that regexp and see in advance that it's going to have "fast when it matches, possibly horrid when it doesn't match" behavior -- and this is a dead easy case to analyze. In a regexp that slobbers on across multiple lines, with 5 levels of group nesting, my guess is that no more than 1 programmer in 1000 has even a vague idea how to start looking for such problems.
We benchmarked it, and without assembly optimizations, uhash still exceeds perl.
It's not just the speed of the hashing itself that you have to consider, but also how it interacts with the rest of the dictionary implementation, and how it behaves under access patters found in a typical Python program. The core Python developers have put a LOT of effort over the years into tuning the Python dict implementation specially for Python. It's not reasonable to expect them to suddenly throw all that away and replace it with something that's completely untried in this context! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Scott Crosby]
... Have you done benchmarking to prove that string_hash is in fact an important hotspot in the python interpreter?
It depends on the specific application, of course. The overall speed of dicts is crucial, but in many "namespace" dict apps the strings are interned, implying (among other things) that a hash code is computed only once per string. In those apps the speed of the string hash function isn't so important. Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy. OTOH, it's just a couple lines of C that run fine on everything from Palm Pilots to Crays, and have created no portability or maintenance headaches. Browsing your site, you've got over 100KB of snaky C code to implement hashing functions, some with known bugs, others with cautions about open questions wrt platforms with different endianness and word sizes than the code was initially tested on. Compared to what Python is using now, that's a maintenance nightmare. Note that Python's hash API doesn't return 32 bits, it returns a hash code of the same size as the native C long. The multiplication gimmick doesn't require any pain to do that. Other points that arise in practical deployment: + Python dicts can be indexed by many kinds of immutable objects. Strings are just one kind of key, and Python has many hash functions. + If I understand what you're selling, the hash code of a given string will almost certainly change across program runs. That's a very visible change in semantics, since hash() is a builtin Python function available to user code. Some programs use hash codes to index into persistent (file- or database- based) data structures, and such code would plain break if the hash code of a string changed from one run to the next. I expect the user-visible hash() would have to continue using a predictable function. + Some Python apps run for months, and universal hashing doesn't remove the possibility of quadratic-time behavior. If I can poke at a long-running app and observe its behavior, over time I can deduce a set of keys that collide badly for any hashing scheme fixed when the program starts. In that sense I don't believe this gimmick wholly plugs the hole it's trying to plug.
If so, and doing one modulo operation per string is unacceptable,
If it's mod by a prime, probably. Some architectures Python runs on require hundreds of cycles to do an integer mod, and we don't have the resources to construct custom mod-by-an-int shortcut code for dozens of obscure architectures.
then you may wish to consider Jenkin's hash. The linux kernel people are switching to using a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no proofs that it is in fact universal. It, however, probably is safe.
Nobody writing a Python program *has* to use a dict. That dicts have quadratic-time worst-case behavior isn't hidden, and there's no cure for that short of switching to a data structure with better worst-case bounds. I certainly agree it's something for programmers to be aware of. I still don't see any reason for the core language to care about this, though.
On Fri, 30 May 2003 19:07:38 -0400, Tim Peters <tim.one@comcast.net> writes:
[Scott Crosby]
... Have you done benchmarking to prove that string_hash is in fact an important hotspot in the python interpreter?
It depends on the specific application, of course. The overall speed of dicts is crucial, but in many "namespace" dict apps the strings are interned, implying (among other things) that a hash code is computed only once per string. In those apps the speed of the string hash function isn't so important. Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy.
Actually, at least on x86, it is faster than perl. On other platforms, it may be somewhat slower.
OTOH, it's just a couple lines of C that run fine on everything from Palm Pilots to Crays, and have created no portability or maintenance headaches. Browsing your site, you've got over 100KB of snaky C code to implement hashing functions, some with known bugs, others with cautions about open questions wrt platforms with different endianness and word sizes than the code was initially tested on. Compared to what Python is using now, that's a maintenance nightmare.
Yes, I am aware of the problems with the UHASH code. Unfortunately, I am not a hash function designer, that code is not mine, and I only use it as a black box. I also consider all code, until verified otherwise, to potentially suffer from endianness, alignment, and 32/64 bit issues. Excluding alignment issues (which I'm not sure whether to say that its OK to fail on strange alignments or not) it has passed *my* self-tests on big endian and 64 bit.
+ Python dicts can be indexed by many kinds of immutable objects. Strings are just one kind of key, and Python has many hash functions.
+ If I understand what you're selling, the hash code of a given string will almost certainly change across program runs. That's a very visible change in semantics, since hash() is a builtin Python function available to user code. Some programs use hash codes to index into persistent (file- or database- based) data structures, and such code would plain break if the hash code of a string changed from one run to the next. I expect the user-visible hash() would have to continue using a predictable function.
The hash has to be keyed upon something. It is possible to store the key in a file and reuse the same one across all runs. However, depending on the universal hash function used, leaking pairs of (input,hash(input)) may allow an attacker to determine the secret key, and allow attack again. But yeah, preserving these semantics becomes very messy. The hash-key becomes part of the system state that must be migrated along with other data that depends on it.
+ Some Python apps run for months, and universal hashing doesn't remove the possibility of quadratic-time behavior. If I can poke at a long-running app and observe its behavior, over time I can deduce a
I argued on linux-kernel with someone else that this was extremely unlikely. It requires the latency of a collision/non-collision being noticable over a noisy system, network stack, and system. In almost all cases, for short inputs, the cost of a single L2 cache miss far exceeds that of hashing. A more serious danger is an application that leaks actual hash values.
If it's mod by a prime, probably.
I'd benchmark it in practice, microbenchmarking on a sparc says that it is rather expensive. However, on an X86, the cost of an L2 cache miss exceeds the cost of hashing a small string. You'd have a better idea what impact this might have on the total runtime of the system in the worst case.
then you may wish to consider Jenkin's hash. The linux kernel people are switching to using a keyed veriant of Jenkin's hash. However, Jenkin's, AFAIK, has no proofs that it is in fact universal. It, however, probably is safe.
Nobody writing a Python program *has* to use a dict. That dicts have quadratic-time worst-case behavior isn't hidden, and there's no cure for
Agreed, many have realized over the years that hash tables can have quadratic behavior in an adversarial environment. It isn't hidden. Cormen, Lieserson, and Rivest even warn about this in their seminal algorithms textbook in 1991. It *is* obvious when thought of, but the reason I was able to ship out so many vulnerability reports yesterday was because few actually *have* thought of that deterministic worst-case when writing their programs. I predict this trend to continue. I like hash tables a lot, with UH, their time bounds are randomized, but are pretty tight and the constant factors far exceed those of balanced binary trees. Scott
[Tim]
... Overall, though, I'd welcome a faster string hash, and I agree that Python's isn't particularly zippy.
[Scott Crosby]
Actually, at least on x86, it is faster than perl. On other platforms, it may be somewhat slower.
Perl's isn't particularly zippy either. I believe that, given heroic coding effort, a good universal hash designed for speed can get under 1 cycle per byte hashed on a modern Pentium. Python's and Perl's string hashes aren't really in the same ballpark.
... Yes, I am aware of the problems with the UHASH code. Unfortunately, I am not a hash function designer, that code is not mine, and I only use it as a black box.
I also consider all code, until verified otherwise, to potentially suffer from endianness, alignment, and 32/64 bit issues. Excluding alignment issues (which I'm not sure whether to say that its OK to fail on strange alignments or not) it has passed *my* self-tests on big endian and 64 bit.
Then who's going to vet the code on a Cray T3 (etc, etc, etc, etc)? This isn't a nag, it cuts to the heart of what a language like Python can do: the x-platform behavior of Python's current string hash is easy to understand, relying only on what the C standard demands. It's doing (only) one evil thing, relying on the unspecified (by C) semantics of what happens when a signed long multiply overflows. Python runs on just about every platform on Earth, and that hasn't been a problem so far. If it becomes a problem, we can change the accumulator to unsigned long, and then C would specify what happens. There ends the exhaustive discusson of all portability questions about our current code <wink>.
...
+ Some Python apps run for months, and universal hashing doesn't remove the possibility of quadratic-time behavior. If I can poke at a long-running app and observe its behavior, over time I can deduce a
I argued on linux-kernel with someone else that this was extremely unlikely. It requires the latency of a collision/non-collision being noticable over a noisy system, network stack, and system.
So you have in mind only apps accessed across a network? And then, for example, discounting the possiblity that a bitter sysadmin opens an interactive Python shell on the box, prints out a gazillion (i, hash(i)) pairs and mails them to himself for future mischief?
In almost all cases, for short inputs, the cost of a single L2 cache miss far exceeds that of hashing.
If the user is restricted to short inputs, provoking quadratic-time behavior doesn't matter.
A more serious danger is an application that leaks actual hash values.
So ever-more operations become "dangerous". Programmers will never keep this straight, and Python isn't a straightjacket language. I still vote that apps for which this matters use an *appropriate* data structure instead -- Python isn't an application, it's a language for programming applications.
... Agreed, many have realized over the years that hash tables can have quadratic behavior in an adversarial environment.
People from real-time backgrounds are much more paranoid than that, and perhaps the security-conscious could learn a lot from them. For example, you're not going to find a dynamic hash table in a real-time app, because they can't take a chance on bad luck either. To them, "an adversary" is just an unlucky roll of the dice, and they can't tolerate it.
It isn't hidden. Cormen, Lieserson, and Rivest even warn about this in their seminal algorithms textbook in 1991.
Knuth warned about it a lot earlier than that <wink -- but see his summary of hashing in TAoCP, Vol 3 -- it gives strong warnings).
It *is* obvious when thought of, but the reason I was able to ship out so many vulnerability reports yesterday was because few actually *have* thought of that deterministic worst-case when writing their programs. I predict this trend to continue.
I appreciate that, and I expect it to continue too. I expect a better solution would be for more languages to offer a choice of containers with different O() behaviors. In C it's hard to make progress because the standard language comes with so little, and so many "portable" C libraries aren't. The C++ world is in better shape, constrained more by the portability of standard C++ itself. There's less excuse for Python or Perl programmers to screw up in these areas, because libraries written in Python and Perl are very portable, and there are lot of 'em to chose from.
I like hash tables a lot, with UH, their time bounds are randomized, but are pretty tight and the constant factors far exceed those of balanced binary trees.
Probably, but have you used a tree implementation into which the same heroic level of analysis and coding effort has been poured? The typical portable-C balanced tree implementation should be viewed as a worst-case bound on how fast balanced trees can actually work. Recently, heroic efforts have been poured into Judy tries, which may be both faster and more memory-efficent than hash tables in many kinds of apps: http://judy.sourceforge.net/ The code for Judy tries makes UHASH look trivial, though. OTOH, provided the Judy code actually works on your box, and there aren't bugs hiding in its thousands of lines of difficult code, relying on a Judy "array" for good worst-case behavior isn't a matter of luck.
+ If I understand what you're selling, the hash code of a given string will almost certainly change across program runs. That's a very visible change in semantics, since hash() is a builtin Python function available to user code. Some programs use hash codes to index into persistent (file- or database- based) data structures, and such code would plain break if the hash code of a string changed from one run to the next. I expect the user-visible hash() would have to continue using a predictable function.
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before). --Guido van Rossum (home page: http://www.python.org/~guido/)
On Fri, May 30, 2003 at 08:41:54PM -0400, Guido van Rossum wrote:
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before).
Is there at least a guarantee that the hashing algorithm won't change in a bugfix release? For instance, can I depend that python222 -c 'print hash(1), hash("a")' python223 -c 'print hash(1), hash("a")' will both output the same thing, even if python23 -c 'print hash(1), hash("a")' and python3000 -c 'print hash(1), hash("a")' may print something different? Jeff
On Fri, May 30, 2003 at 08:41:54PM -0400, Guido van Rossum wrote:
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before).
Is there at least a guarantee that the hashing algorithm won't change in a bugfix release? For instance, can I depend that python222 -c 'print hash(1), hash("a")' python223 -c 'print hash(1), hash("a")' will both output the same thing, even if python23 -c 'print hash(1), hash("a")' and python3000 -c 'print hash(1), hash("a")' may print something different?
That's a reasonable assumption, yes. We realize that changing the hash algorithm is a feature change, even if it is a very subtle one. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Fri, May 30, 2003 at 08:41:54PM -0400, Guido van Rossum wrote:
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before).
Is there at least a guarantee that the hashing algorithm won't change in a bugfix release? For instance, can I depend that python222 -c 'print hash(1), hash("a")' python223 -c 'print hash(1), hash("a")' will both output the same thing, even if python23 -c 'print hash(1), hash("a")' and python3000 -c 'print hash(1), hash("a")' may print something different?
That's a reasonable assumption, yes. We realize that changing the hash algorithm is a feature change, even if it is a very subtle one.
For Scott's proposal to work, it would have to change the hash value on every invocation of Python. If not, colliding keys can be found with a Monte Carlo method. Raymond Hettinger
[Jeff Epler]
Is there at least a guarantee that the hashing algorithm won't change in a bugfix release?
Guido said "yes" weakly, but the issue hasn't come up in recent times. In the past we've changed the hash functions for at least strings, tuples, and floats, based on systematic weaknesses uncovered by real-life ordinary data. OTOH, there's a requirement that, for objects of types that can be used as dict keys, two objects that compare equal must deliver equal hash codes, so people creating (mostly) number-like or (not sure if anyone does this) string-like types have to duplicate the hash codes Python delivers for builtin numbers and strings that compare equal to objects of their types. For example, the author of a Rational class should arrange for hash(Rational(42, 1)) to deliver the same result as hash(42) == hash(42L) == hash(42.0) == hash(complex(42.0, 0.0)) Such code would break if we changed the int/long/float/complex hashes for inputs that compare equal to integers. Tedious exercise for the reader: find a set of bad datetime objects in 2.3 ("bad" in the sense of their hash codes colliding; on a box where hash() returns a 32-bit int, there must be collisions, since datetime objects have way more than 32 independent bits of state).
[Gudio]
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before).
Last Thursday, Jim and I speculated about adding a persistent set type to Zope, implemented via an IOBTree under the covers, mapping int hash codes to lists of elements. I'm sure someone else has already done something "like that"; I know I have in the past, but only to run monstrously large hash-function evaluation tests, where I needed to accumulate results across multiple runs. It's curious that persistence schemes are fragile in areas Python doesn't really care about, like how objects of different types compare to each other, how class instances that don't define __cmp__ compare to each other, and in exactly which bit pattern hash(whatever) returns. In Python 3 it may be nice to define the results of such things more carefully in platform- and release-independent ways -- or maybe to raise exceptions instead of making results up, forcing users to confront the issues explicitly. Lots of tradeoffs here ...
[Guido]
Of course, such programs are already vulnerable to changes in the hash implementation between Python versions (which has happened before).
[Tim]
Last Thursday, Jim and I speculated about adding a persistent set type to Zope, implemented via an IOBTree under the covers, mapping int hash codes to lists of elements. I'm sure someone else has already done something "like that"; I know I have in the past, but only to run monstrously large hash-function evaluation tests, where I needed to accumulate results across multiple runs.
I hope I can dissuade you from using Python's default hash codes; they differ across platforms with different word lengths, too. I don't know enough about the requirements for this proposed persistent set type; perhaps it would be acceptable to require that the set elements define a method that returns a unique identifier which is an immutable object using only Python built-in types?
It's curious that persistence schemes are fragile in areas Python doesn't really care about, like how objects of different types compare to each other, how class instances that don't define __cmp__ compare to each other, and in exactly which bit pattern hash(whatever) returns.
I doubt this is a coincidence. When designing Python's run time, I was very well aware of process boundaries and lifetime, and used them to limit the scope of universal truths. I wasn't really thinking of persistence.
In Python 3 it may be nice to define the results of such things more carefully in platform- and release-independent ways -- or maybe to raise exceptions instead of making results up, forcing users to confront the issues explicitly. Lots of tradeoffs here ...
For comparisons, I've long considered the current "everything can be ordered with respect to everything else" paradigm broken, and for 3.0 I plan to break this, allowing only == and != comparisons between objects of different types that don't specifically cater to cross-type comparisons (the default being unequal if the types differ). Ordering comparisons will default to an exception. But this won't affect hashing; I don't think I'd like to see the hash function fixed by the language standard, so the hash function may still vary between releases (or between runs, for Python implementations that follow Scott's recommendation). --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
I hope I can dissuade you from using Python's default hash codes;
It's easy to dissuade me; it's Jim we have to worry about <wink>.
they differ across platforms with different word lengths, too. I don't know enough about the requirements for this proposed persistent set type; perhaps it would be acceptable to require that the set elements define a method that returns a unique identifier which is an immutable object using only Python built-in types?
I hope so, but don't know the use cases well enough to guess whether that would be considered good enough.
... I doubt this is a coincidence. When designing Python's run time, I was very well aware of process boundaries and lifetime, and used them to limit the scope of universal truths. I wasn't really thinking of persistence.
Neither was I <wink>.
... For comparisons, I've long considered the current "everything can be ordered with respect to everything else" paradigm broken, and for 3.0 I plan to break this, allowing only == and != comparisons between objects of different types that don't specifically cater to cross-type comparisons (the default being unequal if the types differ). Ordering comparisons will default to an exception.
That would go a long way toward keeping people out of trouble with persistent BTree-based structures.
But this won't affect hashing; I don't think I'd like to see the hash function fixed by the language standard, so the hash function may still vary between releases (or between runs, for Python implementations that follow Scott's recommendation).
For Python 3 I hope we (you) can consider another line of flexibility too: sometimes when I build a hash table, I want an __eq__ that isn't "the natural" __eq__ for the key objects. For example, using a dict to partition objects by equivalence class wants to use the equivalence relation of interest at the moment. This implies a specific dict may want to use a "non-standard" __hash__ too. Hiding the real objects in wrapper objects works for this purpose, of course, but sometimes it would be a lot more convenient to pass hash and eq functions to a dict's constructor.
For Python 3 I hope we (you) can consider another line of flexibility too: sometimes when I build a hash table, I want an __eq__ that isn't "the natural" __eq__ for the key objects. For example, using a dict to partition objects by equivalence class wants to use the equivalence relation of interest at the moment. This implies a specific dict may want to use a "non-standard" __hash__ too. Hiding the real objects in wrapper objects works for this purpose, of course, but sometimes it would be a lot more convenient to pass hash and eq functions to a dict's constructor.
Couldn't you write a dict-like class that does the wrapping for you? I'm worried that adding such flexibility to dict will slow down the common case. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (10)
-
Greg Ewing
-
Guido van Rossum
-
Jeff Epler
-
Jeremy Hylton
-
Phillip J. Eby
-
Raymond Hettinger
-
Scott A Crosby
-
Tim Peters
-
Tim Peters
-
Tim Peters