Make str/bytes hash algorithm pluggable?
Hi, some of you may have seen that I'm working on a PEP for a new hash API and new algorithms for hashing of bytes and str. The PEP has three major aspects. It introduces DJB's SipHash as secure hash algorithm, chances the hash API to process blocks of data instead characters and it adds an API to make the algorithm pluggable. A draft is already available [1]. Now I got some negative feedback on the 'pluggable' aspect of the new PEP on Twitter [2]. I like to get feedback from you before I finalize the PEP. The PEP proposes a pluggable hash API for a couple of reasons. I like to give users of Python a chance to replace a secure hash algorithm with a faster hash algorithm. SipHash is about as fast as FNV for common cases as our implementation of FNV process only 8 to 32 bits per cycle instead of 32 or 64. I haven't actually benchmarked how a faster hash algorithm affects the a real program, though ... I also like to make it easier to replace the hash algorithm with a different one in case a vulnerability is found. With the new API vendors and embedders have an easy and clean way to use their own hash implementation or an optimized version that is more suitable for their platform, too. For example a mobile phone vendor could provide an optimized implementation with ARM NEON intrinsics. On which level should Python support a pluggable hash algorithm? 1) Compile time option: The hash code is compiled into Python's core. Embedders have to recompile Python with different options to replace the function. 2) Library option: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time. The approach gives embedders the chance the set their own algorithm without recompiling Python. 3) Startup options: Like 2) plus an additional environment variable and command line argument to select an algorithm. With a startup option users can select a different algorithm themselves. Christian [1] http://www.python.org/dev/peps/pep-0456/ [2] https://twitter.com/EDEADLK/status/385572395777818624
On Thu, 03 Oct 2013 20:42:28 +0200
Christian Heimes
I haven't actually benchmarked how a faster hash algorithm affects the a real program, though ...
Chances are it doesn't. Only a "slow enough" hash algorithm might have an impact, IMHO.
On which level should Python support a pluggable hash algorithm?
1) Compile time option: The hash code is compiled into Python's core. Embedders have to recompile Python with different options to replace the function.
Not much point IMHO. Embedders can patch Python if they really need this.
2) Library option: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
Too complicated. The library option should only offer the option to replace the hash algorithm, not "add an available algorithm".
3) Startup options: Like 2) plus an additional environment variable and command line argument to select an algorithm. With a startup option users can select a different algorithm themselves.
-0.9. I think it's overkill. Regards Antoine.
Hm. I would like to stick to the philosophy that Python's hash should be as
fast as it possibly can be, and should not be mistaken for a cryptographic
hash. The point is to optimize dict lookups, nothing more, given typical
(or even atypical) key distribution, not to thwart deliberate attacks. We
already have adopted a feature that plugged most viable attacks on web
apps, I think that's enough. I also agree with Antoine's response.
On Thu, Oct 3, 2013 at 11:42 AM, Christian Heimes
Hi,
some of you may have seen that I'm working on a PEP for a new hash API and new algorithms for hashing of bytes and str. The PEP has three major aspects. It introduces DJB's SipHash as secure hash algorithm, chances the hash API to process blocks of data instead characters and it adds an API to make the algorithm pluggable. A draft is already available [1].
Now I got some negative feedback on the 'pluggable' aspect of the new PEP on Twitter [2]. I like to get feedback from you before I finalize the PEP.
The PEP proposes a pluggable hash API for a couple of reasons. I like to give users of Python a chance to replace a secure hash algorithm with a faster hash algorithm. SipHash is about as fast as FNV for common cases as our implementation of FNV process only 8 to 32 bits per cycle instead of 32 or 64. I haven't actually benchmarked how a faster hash algorithm affects the a real program, though ...
I also like to make it easier to replace the hash algorithm with a different one in case a vulnerability is found. With the new API vendors and embedders have an easy and clean way to use their own hash implementation or an optimized version that is more suitable for their platform, too. For example a mobile phone vendor could provide an optimized implementation with ARM NEON intrinsics.
On which level should Python support a pluggable hash algorithm?
1) Compile time option: The hash code is compiled into Python's core. Embedders have to recompile Python with different options to replace the function.
2) Library option: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time. The approach gives embedders the chance the set their own algorithm without recompiling Python.
3) Startup options: Like 2) plus an additional environment variable and command line argument to select an algorithm. With a startup option users can select a different algorithm themselves.
Christian
[1] http://www.python.org/dev/peps/pep-0456/ [2] https://twitter.com/EDEADLK/status/385572395777818624
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
Am 03.10.2013 21:05, schrieb Guido van Rossum:
Hm. I would like to stick to the philosophy that Python's hash should be as fast as it possibly can be, and should not be mistaken for a cryptographic hash. The point is to optimize dict lookups, nothing more, given typical (or even atypical) key distribution, not to thwart deliberate attacks. We already have adopted a feature that plugged most viable attacks on web apps, I think that's enough. I also agree with Antoine's response.
Python's hash is neither as fast nor as secure as it can possibly be. It's not as fast because it doesn't use the full power of modern CPUs. In most cases the code processes only 1 or 2 bytes per cycle instead of 8 bytes on 64-bit architectures. Jean-Philippe Aumasson and Daniel J. Bernstein (who are coincidentally the authors of SipHash) have shown how to recover Python randomization keys. SipHash: more secure and about same speed on most systems optimized FNV: faster but with a known issue Christian
On Thu, Oct 3, 2013 at 12:23 PM, Christian Heimes
Am 03.10.2013 21:05, schrieb Guido van Rossum:
Hm. I would like to stick to the philosophy that Python's hash should be as fast as it possibly can be, and should not be mistaken for a cryptographic hash. The point is to optimize dict lookups, nothing more, given typical (or even atypical) key distribution, not to thwart deliberate attacks. We already have adopted a feature that plugged most viable attacks on web apps, I think that's enough. I also agree with Antoine's response.
Python's hash is neither as fast nor as secure as it can possibly be.
But fixing that shouldn't need all the extra stuff you're proposing. It's not as fast because it doesn't use the full power of modern CPUs.
In most cases the code processes only 1 or 2 bytes per cycle instead of 8 bytes on 64-bit architectures. Jean-Philippe Aumasson and Daniel J. Bernstein (who are coincidentally the authors of SipHash) have shown how to recover Python randomization keys.
What's a Python randomization key?
SipHash: more secure and about same speed on most systems
Same speed as what?
optimized FNV: faster but with a known issue
What issue? -- --Guido van Rossum (python.org/~guido)
Am 03.10.2013 21:45, schrieb Guido van Rossum:
But fixing that shouldn't need all the extra stuff you're proposing.
I have proposed some of the extra stuff for more flexibility, the rest is for testing and debugging.
What's a Python randomization key?
Python's hash randomization key, the seed to randomize the output of hash() for bytes and str.
SipHash: more secure and about same speed on most systems
Same speed as what?
Same speed as the current algorithm in Python 3.3 and earlier.
optimized FNV: faster but with a known issue
What issue?
Quote from https://131002.net/siphash/#at --- Jointly with Martin Boßlet, we demonstrated weaknesses in MurmurHash (used in Ruby, Java, etc.), CityHash (used in Google), and in Python's hash. Some of the technologies affected have switched to SipHash. See this oCERT advisory, and the following resources: [...] - Python script https://131002.net/siphash/poc.py to recover the secret seed of the hash randomization in Python 2.7.3 and 3.2.3 --- It's all documented in my PEP draft, too. Christian
On Thu, Oct 3, 2013 at 12:55 PM, Christian Heimes
Am 03.10.2013 21:45, schrieb Guido van Rossum:
But fixing that shouldn't need all the extra stuff you're proposing.
I have proposed some of the extra stuff for more flexibility, the rest is for testing and debugging.
Hm, I don't think we need more infrastructure for this. As Antoine said, if you're hacking on this you might as well edit the source.
What's a Python randomization key?
Python's hash randomization key, the seed to randomize the output of hash() for bytes and str.
Is the seed itself crypto-safe? (I.e. is it derived carefully from urandom?)
SipHash: more secure and about same speed on most systems
Same speed as what?
Same speed as the current algorithm in Python 3.3 and earlier.
OK, then I have no objection to switching to it, *if* the security issue is really worth fixing. Otherwise it would be better to look for a hash that is *faster*, given your assertion that the current hash is inefficient.
optimized FNV: faster but with a known issue
What issue?
Quote from https://131002.net/siphash/#at --- Jointly with Martin Boßlet, we demonstrated weaknesses in MurmurHash (used in Ruby, Java, etc.), CityHash (used in Google), and in Python's hash. Some of the technologies affected have switched to SipHash. See this oCERT advisory, and the following resources:
[...]
- Python script https://131002.net/siphash/poc.py to recover the secret seed of the hash randomization in Python 2.7.3 and 3.2.3
Sounds a bit like some security researchers drumming up business. If you can run the binary, presumably you can also recover the seed by looking in /proc, right? Or use ctypes or something. This demonstration seems of academic interest only.
---
It's all documented in my PEP draft, too.
Yeah, there's lots of stuff there. I'm looking for the TL;DR version. :-) -- --Guido van Rossum (python.org/~guido)
03.10.13 23:47, Guido van Rossum написав(ла):
On Thu, Oct 3, 2013 at 12:55 PM, Christian Heimes
mailto:christian@python.org> wrote: Am 03.10.2013 21:45, schrieb Guido van Rossum: > But fixing that shouldn't need all the extra stuff you're > proposing.
I have proposed some of the extra stuff for more flexibility, the rest is for testing and debugging.
Hm, I don't think we need more infrastructure for this. As Antoine said, if you're hacking on this you might as well edit the source.
What we could do is to move all hash-related stuff into separated .c and .h files.
> SipHash: more secure and about same speed on most systems > > Same speed as what?
Same speed as the current algorithm in Python 3.3 and earlier.
OK, then I have no objection to switching to it, *if* the security issue is really worth fixing. Otherwise it would be better to look for a hash that is *faster*, given your assertion that the current hash is inefficient.
Actually same speed only for UCS1 string. For UCS2 and UCS4 strings it can be 5x to 10x slower [1]. But I don't known how it affects real programs. [1] http://bugs.python.org/issue14621#msg175048
Hi Guido,
On Thu, Oct 3, 2013 at 10:47 PM, Guido van Rossum
Sounds a bit like some security researchers drumming up business. If you can run the binary, presumably you can also recover the seed by looking in /proc, right? Or use ctypes or something. This demonstration seems of academic interest only.
From this point of view I'm saluting Christian's effort, even if I
I'll not try to defend the opposite point of view very actively, but let me just say that, in my opinion, your objection is not valid. It is broken the same way as a different objection, which would claim that Python can be made sandbox-safe without caring about the numerous segfault cases. They are all very obscure for sure; I tried at some point to list them in Lib/test/crashers. I gave up when people started deleting the files because they no longer crashed on newer versions, just because details changed --- but not because the general crash they explained was in any way fixed... Anyway, my point is that most segfaults can, given enough effort, be transformed into a single, well-documented tool to conduct a large class of attacks. The hash issue is similar. It should be IMHO either ignored (which is fine for a huge fraction of users), or seriously fixed by people with the correctly pessimistic approach. The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server. The only benefit of this hash randomization option (-R) was to say to the press that Python fixed very quickly the problem when it was mediatized :-/ This kind of security issues should never be classified as "academic interest only". Instead they can be classified as "it will take weeks / months / years before some crazy man manages to put together a general attack script, but likely, someone will eventually". prefer to stay far away from this kind of issues myself :-) A bientôt, Armin.
2013/10/4 Armin Rigo
The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server.
Oh interesting, is it public? If yes, could we please search the URL of the exploit? I'm more motivated to fix an issue if it is proved to be exploitable. I still fail to understand the real impact of a hash DoS compared to other kinds of DoS. It's like the XML bomb: the vulnerability was also known since many years, but Christian only fixed the issue recently (and the fix was implemented in a package on the Cheeseshop, not in the stblib! Is that correct?).
The only benefit of this hash randomization option (-R) was to say to the press that Python fixed very quickly the problem when it was mediatized :-/
The real benefit is to warn users that they should not rely on the dictionary or set order/representation (in their unit tests), and that the hash function is not deterministic :-) (So now it is much easier to replace the hash function with SipHash or anything else, without breaking new applications.) Victor
Am 04.10.2013 11:15, schrieb Victor Stinner:
2013/10/4 Armin Rigo
: The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server.
Oh interesting, is it public? If yes, could we please search the URL of the exploit? I'm more motivated to fix an issue if it is proved to be exploitable.
I'm intrigued, too!
I still fail to understand the real impact of a hash DoS compared to other kinds of DoS. It's like the XML bomb: the vulnerability was also known since many years, but Christian only fixed the issue recently (and the fix was implemented in a package on the Cheeseshop, not in the stblib! Is that correct?).
About the XML bomb and other issues ... I kinda lost my motivation to push the fixes into the stdlib. :( The code is ready. It just needs a proper configuration interface / API. The hash DoS and XML DoS vulnerabilities have one thing in common. Both multiply the effectiveness of an attack by several orders of magnitude. You don't need 100 GBit/sec to kick a service out of existence. A simple DSL line or mobile phone with 3G/HSDPA does the same job (if done right). Nowaday Python is important, for example major parts of the Brazilian Government run on Python, Zope and Plone. There are Dropbox, Google App Engine ...
The real benefit is to warn users that they should not rely on the dictionary or set order/representation (in their unit tests), and that the hash function is not deterministic :-)
(So now it is much easier to replace the hash function with SipHash or anything else, without breaking new applications.)
Thanks for your groundwork and groudbreaking work, Victor! :) Christian
Le Fri, 4 Oct 2013 11:15:17 +0200,
Victor Stinner
2013/10/4 Armin Rigo
: The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server.
Oh interesting, is it public? If yes, could we please search the URL of the exploit? I'm more motivated to fix an issue if it is proved to be exploitable.
I still fail to understand the real impact of a hash DoS compared to other kinds of DoS. It's like the XML bomb: the vulnerability was also known since many years, but Christian only fixed the issue recently (and the fix was implemented in a package on the Cheeseshop, not in the stblib! Is that correct?).
The only benefit of this hash randomization option (-R) was to say to the press that Python fixed very quickly the problem when it was mediatized :-/
The real benefit is to warn users that they should not rely on the dictionary or set order/representation (in their unit tests), and that the hash function is not deterministic :-)
I agree it probably had educational value. Regards Antoine.
Quoting Victor Stinner
I still fail to understand the real impact of a hash DoS compared to other kinds of DoS.
I think the key question is: how many attacking nodes do you need to control to effectively make some system deny service. A threat is bigger if you can do it in 10 requests/s from a single host, instead of needing 10,000 hosts, each making 1000 request/s. With the hash DoS, the threat is that if you manage to fill some dictionary with colliding keys, then each lookup will take a very long time, and you might arrange to put many lookups into a single HTTP request. So a single HTTP request might get very costly CPU-wise. Whether this is a serious threat or not depends on what other threats the system being attacked is vulnerable to. Maybe there is something even simpler, or maybe the hash attack is the only hope of bringing the system to its knees. IMO, the hash attack is particularly tricky since it is very easy to argue and very difficult to demonstrate. So it can result in fear and uncertainty very easily, causing people to overreact just so that they won't be accused of inactivity. Regards, Martin
Le Fri, 04 Oct 2013 17:13:32 +0200, martin@v.loewis.de a écrit :
Whether this is a serious threat or not depends on what other threats the system being attacked is vulnerable to. Maybe there is something even simpler, or maybe the hash attack is the only hope of bringing the system to its knees.
IMO, the hash attack is particularly tricky since it is very easy to argue and very difficult to demonstrate.
If you know how to generate colliding hashes, it's actually relatively easy to demonstrate, assuming you know how a particular Web application processes its incoming requests (which you do if it's a standard Web application such as hgweb). Regards Antoine.
2013/10/4
Quoting Victor Stinner
: I still fail to understand the real impact of a hash DoS compared to other kinds of DoS.
I think the key question is: how many attacking nodes do you need to control to effectively make some system deny service. A threat is bigger if you can do it in 10 requests/s from a single host, instead of needing 10,000 hosts, each making 1000 request/s.
Correct. I know that they are some other "cheap" attacks directly at the network layer. For example, the "spamhaus/CloudFlare" attack which made a lot of noise ("300 Gbit/sec") used a DNS "trick": "The traffic is being generated primarily from DNS amplification attacks. Small requests are sent to DNS servers, generating responses from those servers that are about 50-100 times larger." http://arstechnica.com/security/2013/03/spamhaus-ddos-grows-to-internet-thre... In this case, you still need many computers to DoS a server (=> DDoS).
With the hash DoS, the threat is that if you manage to fill some dictionary with colliding keys, then each lookup will take a very long time, and you might arrange to put many lookups into a single HTTP request. So a single HTTP request might get very costly CPU-wise.
Ok, but why should we invest time to fix this specific DoS wheras there are other DoS like XML bomb? Why not setting a limit on the CPU time in your favorite web framework instead? I don't know the complexity of adding sandbox-like features to a web framework. (It's probably complex because we are discussing how to fix the issue directly in Python :-))
Whether this is a serious threat or not depends on what other threats the system being attacked is vulnerable to. Maybe there is something even simpler, or maybe the hash attack is the only hope of bringing the system to its knees.
Popular DDoS attack are usually the simplest, like flooding the server with ping requests, flooding the DNS server, flooding with HTTP requests which take a lot of time ot process, etc. Using a botnet, you don't care of using an inefficient DoS attack, because your power is the number of zombi. I have no idea of the price of renting a botnet, it's probably expensive (and illegal as well).
IMO, the hash attack is particularly tricky since it is very easy to argue and very difficult to demonstrate. So it can result in fear and uncertainty very easily, causing people to overreact just so that they won't be accused of inactivity.
It would be easy to evaluate the risk with a public exploit on a real world application :-) Victor
On Sat, Oct 05, 2013 at 01:27:37AM +0200, Victor Stinner wrote:
I have no idea of the price of renting a botnet, it's probably expensive (and illegal as well).
Twelve cents per machine. Cheaper in bulk, and cheaper still for machines outside of the US. For those on a budget, you can get ten thousand zombie machines scattered all over the world for two cents each. http://threatpost.com/how-much-does-botnet-cost-022813/77573 I believe you can also rent a botnet for $2 an hour. -- Steven
Am 05.10.13 01:27, schrieb Victor Stinner:
Ok, but why should we invest time to fix this specific DoS wheras there are other DoS like XML bomb?
That is a question about the very mechanics of free software. "We" don't need to invest time into anything (and you may have noticed that I lately actually don't :-) If you think this is a waste of time, just sit back and watch it evolve - it's Christian's time that may get wasted (and the time of anybody who choses to respond). He is writing a PEP, and the same question can be asked about any feature that goes into Python: Why this feature, and not a different one? FWIW, I personally think that a lot of effort was wasted in micro-optimizing the Unicode implementation :-) If you actually think that changing this aspect of Python is a bad idea, then you do need to get involved actively opposing the PEP. I personally think that this "pluggable hash function" stuff is a bad idea. Changing the hash function is ok as long as it doesn't get dramatically slower.
Why not setting a limit on the CPU time in your favorite web framework instead?
Because that is not implementable, in general, and might harm the service. If you disagree about the non-implementability, please propose a specific technology to limit the CPU consumption *per HTTP request*. It might harm the service because /some/ requests might be eligible to high CPU cost. So put in your sandbox technology a mechanism to white-list specific URLs, or to have the CPU limit depend on the URL that is being requested.
Popular DDoS attack are usually the simplest, like flooding the server with ping requests, flooding the DNS server, flooding with HTTP requests which take a lot of time ot process, etc. Using a botnet, you don't care of using an inefficient DoS attack, because your power is the number of zombi.
I have no idea of the price of renting a botnet, it's probably expensive (and illegal as well).
Talking about actual attackers, I think the concern here are script kiddies: people who don't want to invest a lot of money into some illegal activity, but who just learned that they can kill service XYZ if they run this-or-that script - and want to try out whether this actually works. I believe that profesional criminals aren't too interested in DDoS; they buy the botnets to distribute spam. Regards, Martin
On 10/04/2013 11:15 AM, Victor Stinner wrote:
2013/10/4 Armin Rigo
: The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server. Oh interesting, is it public?
http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html Quoting the synopsis: We also describe a vulnerability of Python's new randomized hash, allowing an attacker to easily recover the 128-bit secret seed. I found all that while reading this interesting, yet moribund, bug report: http://bugs.python.org/issue14621 I guess there was enough bike shedding that people ran out of steam, or something. It happens. //arry/
2013/10/5 Larry Hastings
On 10/04/2013 11:15 AM, Victor Stinner wrote:
2013/10/4 Armin Rigo
: The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server.
Oh interesting, is it public?
http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
Quoting the synopsis:
We also describe a vulnerability of Python's new randomized hash, allowing an attacker to easily recover the 128-bit secret seed.
SipHash homepage contains a "proof of concept" to compute the secret: https://131002.net/siphash/poc.py But the script is not an exploit on a web server, but a script running locally. It requires for example to know the hash of strings "\0" and "\0\0". I would like to know if it's possible to retrieve such information in practice. And how do you retrieve the whole hash value from an HTTP page? You may retrieve some bits using specific HTTP requests, but not directly the whole hash value. I don't know any web page displaying directly the hash value of a string coming from the user request!? I'm not saying that the hash DoS does not exist, I'm just trying to estimate the risk (compared to other DoS attacks). Changing the default hash function is also risky and has a (well, minor) impact on performances. Victor
On 10/05/2013 01:14 AM, Victor Stinner wrote:
And how do you retrieve the whole hash value from an HTTP page? You may retrieve some bits using specific HTTP requests, but not directly the whole hash value. I don't know any web page displaying directly the hash value of a string coming from the user request!?
Armin Rigo handwaves his way through an approach here: http://bugs.python.org/issue14621#msg173455 You use a "timing attack" to get the algorithm to "leak" a bit at a time. I have no idea how that actually works, I don't have a background in security, nor a sufficiently devious mindset to work it out for myself. //arry/
On Thu, Oct 3, 2013 at 11:20 PM, Armin Rigo
Hi Guido,
On Thu, Oct 3, 2013 at 10:47 PM, Guido van Rossum
wrote: Sounds a bit like some security researchers drumming up business. If you can run the binary, presumably you can also recover the seed by looking in /proc, right? Or use ctypes or something. This demonstration seems of academic interest only.
I'll not try to defend the opposite point of view very actively, but let me just say that, in my opinion, your objection is not valid. It is broken the same way as a different objection, which would claim that Python can be made sandbox-safe without caring about the numerous segfault cases. They are all very obscure for sure; I tried at some point to list them in Lib/test/crashers. I gave up when people started deleting the files because they no longer crashed on newer versions, just because details changed --- but not because the general crash they explained was in any way fixed... Anyway, my point is that most segfaults can, given enough effort, be transformed into a single, well-documented tool to conduct a large class of attacks.
The hash issue is similar. It should be IMHO either ignored (which is fine for a huge fraction of users), or seriously fixed by people with the correctly pessimistic approach. The current hash randomization is simply not preventing anything; someone posted long ago a way to recover bit-by-bit the hash randomized used by a remote web program in Python running on a server. The only benefit of this hash randomization option (-R) was to say to the press that Python fixed very quickly the problem when it was mediatized :-/
This kind of security issues should never be classified as "academic interest only". Instead they can be classified as "it will take weeks / months / years before some crazy man manages to put together a general attack script, but likely, someone will eventually".
From this point of view I'm saluting Christian's effort, even if I prefer to stay far away from this kind of issues myself :-)
I don't want to defend my position too actively either: I'm fine with a new hash function as long as it's either faster, or safer and not slower, and I think that being able to change it at compile time is good enough. However, I do want to object to the view (common among security "purists") that any vulnerability, no matter how hard to exploit (or fix) needs to be treated the same. I consider security work as a race of arms, where the effort spent on defense must be commensurate with the risk of attack. Just because your bike lock can still be busted by a professional bike thief (or picked by an ingenuous MIT student) that doesn't mean it's worthless -- especially if the unpickable lock might be worth (or weigh!) more than the bike. A good defense combines multiple techniques -- e.g. common sense about when and where you park your bike together with a mediocre lock might make the theft risk acceptably low. -- --Guido van Rossum (python.org/~guido)
On Sat, 5 Oct 2013 08:25:01 -0700
Guido van Rossum
A good defense combines multiple techniques -- e.g. common sense about when and where you park your bike together with a mediocre lock might make the theft risk acceptably low.
(just remember to park it under a shed with an appropriate colour) Regards Antoine.
On Thu, Oct 3, 2013 at 12:05 PM, Guido van Rossum
We already have adopted a feature that plugged most viable attacks on web apps, I think that's enough.
Actually... we did not do a very good job on that: http://bugs.python.org/issue14621 The point of allowing alternates is to let people with needs choose something else if they want without having to jump through hoops of modifying the guts of Python to do it. I don't expect python as shipped by most OS distros to use anything other than our default. -gps
Just some comments.
the first time time with a bit shift of 7
Double "time".
with a 128bit seed and 64-bit output
Inconsistancy with hyphen. There are same issues in other places.
bytes_hash provides the tp_hash slot function for unicode.
Typo. Should be "unicode_hash".
len = PyUnicode_GET_LENGTH(self); switch (PyUnicode_KIND(self)) { case PyUnicode_1BYTE_KIND: { const Py_UCS1 *c = PyUnicode_1BYTE_DATA(self); x = _PyHash_Func->hashfunc(c, len * sizeof(Py_UCS1)); break; } case PyUnicode_2BYTE_KIND: { ...
x = _PyHash_Func->hashfunc(PyUnicode_BYTE_DATA(self), PyUnicode_GET_LENGTH(self) * PyUnicode_KIND(self));
Equal hash values result in a hash collision and therefore cause a minor speed penalty for dicts and sets with mixed keys. The cause of the
collision could be removed
I doubt about this. If one collects bytes and strings in one dictionary, this equality will only double the number of collisions (for DoS attack we need increase it by thousands and millions times). So it doesn't matter. On the other hand, I one deliberately uses bytes and str subclasses with overridden equality, same hash for ASCII bytes and strings can be needed.
For very short strings the setup costs for SipHash dominates its speed but it is still in the same order of magnitude as the current FNV code.
We could use other algorithm for very short strings if it makes matter.
The summarized total runtime of the benchmark is within 1% of the runtime of an unmodified Python 3.4 binary.
What about deviations of individual tests?
Am 03.10.2013 21:53, schrieb Serhiy Storchaka:
the first time time with a bit shift of 7
Double "time".
thx, fixed
with a 128bit seed and 64-bit output
Inconsistancy with hyphen. There are same issues in other places.
I have unified the use of hyphens, thx!
bytes_hash provides the tp_hash slot function for unicode.
Typo. Should be "unicode_hash".
Fixed
x = _PyHash_Func->hashfunc(PyUnicode_BYTE_DATA(self), PyUnicode_GET_LENGTH(self) * PyUnicode_KIND(self));
Oh nice, that's easier to read. It's PyUnicode_DATA().
I doubt about this. If one collects bytes and strings in one dictionary, this equality will only double the number of collisions (for DoS attack we need increase it by thousands and millions times). So it doesn't matter. On the other hand, I one deliberately uses bytes and str subclasses with overridden equality, same hash for ASCII bytes and strings can be needed.
It's not a big problem. I merely wanted to point out that there is a simple possibility for a minor optimization. That's all. :)
For very short strings the setup costs for SipHash dominates its speed but it is still in the same order of magnitude as the current FNV code.
We could use other algorithm for very short strings if it makes matter.
I though of that, too. The threshold is rather small, though. As far as I remember an effective hash collision DoS works with 7 or 8 chars.
The summarized total runtime of the benchmark is within 1% of the runtime of an unmodified Python 3.4 binary.
What about deviations of individual tests?
Here you go. http://pastebin.com/dKdnBCgb http://pastebin.com/wtfUS5Zz Christian
2013/10/3 Christian Heimes
A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
"Py_Initialize" is not the good guard. Try for example "python3 -X faulthandler": PyObject_Hash() is called before Py_Initialize() to add "faulthandler" key into sys._xoptions dictionary. Today many Python internal functions are used before Python is initialized... See the PEP 432 which proposes to improve the situation: http://www.python.org/dev/peps/pep-0432/ Victor
On 4 Oct 2013 06:08, "Victor Stinner"
2013/10/3 Christian Heimes
: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
"Py_Initialize" is not the good guard. Try for example "python3 -X faulthandler": PyObject_Hash() is called before Py_Initialize() to add "faulthandler" key into sys._xoptions dictionary.
Today many Python internal functions are used before Python is initialized... See the PEP 432 which proposes to improve the situation: http://www.python.org/dev/peps/pep-0432/
That problem exists because our main function doesn't follow the C API usage rules, though. We require other embedding applications to be better behaved than that if they want support :) That said, while I'm mostly in favour of the PEP, I think setting the algorithm should be a private API for 3.4. I do agree that since the platform support for SipHash is slightly narrower, we need to keep the existing hash algorithm around, make it relatively easy to enable and ensure we continue to test it on the build bots. I believe that last requirement for buildbot testing is the one that should drive the design of the private configuration API. Cheers, Nick.
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
On Thu, Oct 3, 2013 at 2:13 PM, Nick Coghlan
On 4 Oct 2013 06:08, "Victor Stinner"
wrote: 2013/10/3 Christian Heimes
: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
"Py_Initialize" is not the good guard. Try for example "python3 -X faulthandler": PyObject_Hash() is called before Py_Initialize() to add "faulthandler" key into sys._xoptions dictionary.
Today many Python internal functions are used before Python is initialized... See the PEP 432 which proposes to improve the situation: http://www.python.org/dev/peps/pep-0432/
That problem exists because our main function doesn't follow the C API usage rules, though. We require other embedding applications to be better behaved than that if they want support :)
That said, while I'm mostly in favour of the PEP, I think setting the algorithm should be a private API for 3.4.
I do agree that since the platform support for SipHash is slightly narrower, we need to keep the existing hash algorithm around, make it relatively easy to enable and ensure we continue to test it on the build bots.
I believe that last requirement for buildbot testing is the one that should drive the design of the private configuration API.
I'll defer to Nick for approval of this PEP. -- --Guido van Rossum (python.org/~guido)
On 4 Oct 2013 07:17, "Guido van Rossum"
On Thu, Oct 3, 2013 at 2:13 PM, Nick Coghlan
wrote: On 4 Oct 2013 06:08, "Victor Stinner"
wrote: 2013/10/3 Christian Heimes
: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
"Py_Initialize" is not the good guard. Try for example "python3 -X faulthandler": PyObject_Hash() is called before Py_Initialize() to add "faulthandler" key into sys._xoptions dictionary.
Today many Python internal functions are used before Python is initialized... See the PEP 432 which proposes to improve the situation: http://www.python.org/dev/peps/pep-0432/
That problem exists because our main function doesn't follow the C API
usage rules, though. We require other embedding applications to be better behaved than that if they want support :)
That said, while I'm mostly in favour of the PEP, I think setting the
algorithm should be a private API for 3.4.
I do agree that since the platform support for SipHash is slightly
narrower, we need to keep the existing hash algorithm around, make it relatively easy to enable and ensure we continue to test it on the build bots.
I believe that last requirement for buildbot testing is the one that
should drive the design of the private configuration API.
I'll defer to Nick for approval of this PEP.
Sure, I'm happy to be BDFL delegate for this one. Christian, go ahead and add the appropriate header to the PEP. Cheers, Nick.
-- --Guido van Rossum (python.org/~guido)
Hi, I have found a link to a speed comparison of hash functions: http://code.google.com/p/xxhash/ Regards, Wolfgang
On Thu, Oct 3, 2013 at 1:06 PM, Victor Stinner
2013/10/3 Christian Heimes
: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time.
"Py_Initialize" is not the good guard. Try for example "python3 -X faulthandler": PyObject_Hash() is called before Py_Initialize() to add "faulthandler" key into sys._xoptions dictionary.
Today many Python internal functions are used before Python is initialized... See the PEP 432 which proposes to improve the situation: http://www.python.org/dev/peps/pep-0432/
then I withdraw my desire for setting it before that. compile time is fine. but how would you make that usefully easier than the existing method of replacing the two functions in our bytes and str implementations? if you want a compile time flag, perhaps just call it --enable-sip-hash and get it over with since that's what we really want. ;) -gps
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org
On Thu, Oct 3, 2013 at 11:42 AM, Christian Heimes
Hi,
some of you may have seen that I'm working on a PEP for a new hash API and new algorithms for hashing of bytes and str. The PEP has three major aspects. It introduces DJB's SipHash as secure hash algorithm, chances the hash API to process blocks of data instead characters and it adds an API to make the algorithm pluggable. A draft is already available [1].
Now I got some negative feedback on the 'pluggable' aspect of the new PEP on Twitter [2]. I like to get feedback from you before I finalize the PEP.
The PEP proposes a pluggable hash API for a couple of reasons. I like to give users of Python a chance to replace a secure hash algorithm with a faster hash algorithm. SipHash is about as fast as FNV for common cases as our implementation of FNV process only 8 to 32 bits per cycle instead of 32 or 64. I haven't actually benchmarked how a faster hash algorithm affects the a real program, though ...
I also like to make it easier to replace the hash algorithm with a different one in case a vulnerability is found. With the new API vendors and embedders have an easy and clean way to use their own hash implementation or an optimized version that is more suitable for their platform, too. For example a mobile phone vendor could provide an optimized implementation with ARM NEON intrinsics.
On which level should Python support a pluggable hash algorithm?
1) Compile time option: The hash code is compiled into Python's core. Embedders have to recompile Python with different options to replace the function.
This would be fine with me.
2) Library option: A hash algorithm can be added and one avaible hash algorithm can be set before Py_Initialize() is called for the first time. The approach gives embedders the chance the set their own algorithm without recompiling Python.
This would be more convenient. But I only want to replace the algorithm in my code embedding CPython, not add one. So long as the performance impact of supporting this is not usefully relevant, do that (let me supply the algorithm to be used for each of bytes and str before Py_Initialize is called).
3) Startup options: Like 2) plus an additional environment variable and command line argument to select an algorithm. With a startup option users can select a different algorithm themselves.
I can't imagine any reason I or anyone else would ever want this. side note: In Python 2 and earlier the hash algorithm went to great lengths to make unicode and bytes values that were the same (in at least ascii, possibly latin-1 or utf-8 as well) hash to the same value. Is it safe to assume that very annoying performance sapping invariant is no longer required in Python 3 given that the whole default encoding for bytes to unicode comparisons is gone? (and thus the need for them to land in the same dict hash bucket) -gps
participants (13)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Armin Rigo
-
Christian Heimes
-
Gregory P. Smith
-
Guido van Rossum
-
Larry Hastings
-
martin@v.loewis.de
-
Nick Coghlan
-
Serhiy Storchaka
-
Steven D'Aprano
-
Victor Stinner
-
Wolfgang Langner