Hashes in Python3.5 for tuples and frozensets
This may be known but I wanted to ask this esteemed body first. I understand that from Python3.3 there was a security fix to ensure that different python processes would generate different hash value for the same input - to prevent denial of service based on crafted hash conflicts. I opened two python REPLs on my Linux 64bit PC and did the following Terminal 1: >>> hash('Hello World') -1010252950208276719 >>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313 >>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356 Terminal 2: >>> hash('Hello World') -8267767374510285039 >>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313 >>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356 As can be seen - taking a hash of a string does indeed create a different value between the two processes (as expected). However the frozen set hash, the same in both cases, as is the hash of the tuples - suggesting that the vulnerability resolved in Python 3.3 wasn't resolved across all potentially hashable values. lI even used different large numbers to ensure that the integers weren't being interned. I can imagine that frozensets aren't used frequently as hash keys - but I would think that tuples are regularly used. Since that their hashes are not salted does the vulnerability still exist in some form ?. -- -- Anthony Flury email : *Anthony.flury@btinternet.com* Twitter : *@TonyFlury https://twitter.com/TonyFlury/*
On May 16, 2018, at 5:48 PM, Anthony Flury via Python-Dev
wrote: However the frozen set hash, the same in both cases, as is the hash of the tuples - suggesting that the vulnerability resolved in Python 3.3 wasn't resolved across all potentially hashable values.
You are correct. The hash randomization only applies to strings. None of the other object hashes were altered. Whether this is a vulnerability or not depends greatly on what is exposed to users (generally strings) and how it is used. For the most part, it is considered a feature that integers hash to themselves. That is very fast to compute :-) Also, it tends to prevent hash collisions for consecutive integers. Raymond
On 2018-05-16 18:10, Raymond Hettinger wrote:
On May 16, 2018, at 5:48 PM, Anthony Flury via Python-Dev
wrote: However the frozen set hash, the same in both cases, as is the hash of the tuples - suggesting that the vulnerability resolved in Python 3.3 wasn't resolved across all potentially hashable values.
You are correct. The hash randomization only applies to strings. None of the other object hashes were altered. Whether this is a vulnerability or not depends greatly on what is exposed to users (generally strings) and how it is used.
For the most part, it is considered a feature that integers hash to themselves. That is very fast to compute :-) Also, it tends to prevent hash collisions for consecutive integers.
Raymond is 100% correct. Just one small nit pick: randomization applies to both string and bytes. Christian
Hi,
String hash is randomized, but not the integer hash:
$ python3.5 -c 'print(hash("abc"))'
-8844814677999896014
$ python3.5 -c 'print(hash("abc"))'
-7757160699952389646
$ python3.5 -c 'print(hash(1))'
1
$ python3.5 -c 'print(hash(1))'
1
frozenset hash is combined from values of the set. So it's only
randomized if values hashes are randomized.
The denial of service is more likely to occur with strings as keys,
than with integers.
See the following link for more information:
http://python-security.readthedocs.io/vuln/cve-2012-1150_hash_dos.html
Victor
2018-05-16 17:48 GMT-04:00 Anthony Flury via Python-Dev
This may be known but I wanted to ask this esteemed body first.
I understand that from Python3.3 there was a security fix to ensure that different python processes would generate different hash value for the same input - to prevent denial of service based on crafted hash conflicts.
I opened two python REPLs on my Linux 64bit PC and did the following
Terminal 1:
>>> hash('Hello World') -1010252950208276719
>>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313
>>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356
Terminal 2:
>>> hash('Hello World') -8267767374510285039
>>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313
>>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356
As can be seen - taking a hash of a string does indeed create a different value between the two processes (as expected).
However the frozen set hash, the same in both cases, as is the hash of the tuples - suggesting that the vulnerability resolved in Python 3.3 wasn't resolved across all potentially hashable values. lI even used different large numbers to ensure that the integers weren't being interned.
I can imagine that frozensets aren't used frequently as hash keys - but I would think that tuples are regularly used. Since that their hashes are not salted does the vulnerability still exist in some form ?.
-- -- Anthony Flury email : *Anthony.flury@btinternet.com* Twitter : *@TonyFlury https://twitter.com/TonyFlury/*
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
Victor, Thanks for the link, but to be honest it will just confuse people - neither the link or the related bpo entries state that the fix is only limited to strings. They simply talk about hash randomization - which in my opinion implies ALL hash algorithms; which is why I asked the question. I am not sure how much should be exposed about the scope of security fixes but you can understand my (and other's) confusion. I am aware that applications shouldn't make assumptions about the value of any given hash value - apart from some simple assumptions based hash value equality (i.e. if two objects have different hash values they can't be the same value). /B//TW : // // //This question was prompted by a question on a social media platform about the whether hash values are transferable between across platforms. Everything I could find stated that after Python 3.3 ALL hash values were randomized - but that clearly isn't the case; and the original questioner identified that some hash values are randomized and other aren't.// // //I did suggest strongly to the original questioner that relying on the same hash value across different platforms wasn't a clever solution - their original plan was to store hash values in a cross system database to enable quick retrieval of data (!!!). I did remind the OP that a hash value wasn't guaranteed to be unique anyway - and they might come across two different values with the same hash - and no way to distinguish between them if all they have is the hash. Hopefully their revised design will store the key, not the hash./ On 17/05/18 07:38, Victor Stinner wrote:
Hi,
String hash is randomized, but not the integer hash:
$ python3.5 -c 'print(hash("abc"))' -8844814677999896014 $ python3.5 -c 'print(hash("abc"))' -7757160699952389646
$ python3.5 -c 'print(hash(1))' 1 $ python3.5 -c 'print(hash(1))' 1
frozenset hash is combined from values of the set. So it's only randomized if values hashes are randomized.
The denial of service is more likely to occur with strings as keys, than with integers.
See the following link for more information: http://python-security.readthedocs.io/vuln/cve-2012-1150_hash_dos.html
Victor
2018-05-16 17:48 GMT-04:00 Anthony Flury via Python-Dev
: This may be known but I wanted to ask this esteemed body first.
I understand that from Python3.3 there was a security fix to ensure that different python processes would generate different hash value for the same input - to prevent denial of service based on crafted hash conflicts.
I opened two python REPLs on my Linux 64bit PC and did the following
Terminal 1:
>>> hash('Hello World') -1010252950208276719
>>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313
>>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356
Terminal 2:
>>> hash('Hello World') -8267767374510285039
>>> hash( frozenset({1,9}) ) -7625378979602737914 >>> hash(frozenset({300,301})) -8571255922896611313
>>> hash((1,9)) 3713081631926832981 >>> hash((875,932)) 3712694086932196356
As can be seen - taking a hash of a string does indeed create a different value between the two processes (as expected).
However the frozen set hash, the same in both cases, as is the hash of the tuples - suggesting that the vulnerability resolved in Python 3.3 wasn't resolved across all potentially hashable values. lI even used different large numbers to ensure that the integers weren't being interned.
I can imagine that frozensets aren't used frequently as hash keys - but I would think that tuples are regularly used. Since that their hashes are not salted does the vulnerability still exist in some form ?.
-- -- Anthony Flury email : *Anthony.flury@btinternet.com* Twitter : *@TonyFlury https://twitter.com/TonyFlury/*
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
-- -- Anthony Flury email : *Anthony.flury@btinternet.com* Twitter : *@TonyFlury https://twitter.com/TonyFlury/*
Anthony Flury via Python-Dev wrote:
//I did suggest strongly to the original questioner that relying on the same hash value across different platforms wasn't a clever solution
Even without randomisation, I wouldn't rely on hash values staying the same between different Python versions. Storing them in a database sounds like a really bad idea. -- Greg
On Thu, May 17, 2018 at 5:21 PM, Anthony Flury via Python-Dev
Victor, Thanks for the link, but to be honest it will just confuse people - neither the link or the related bpo entries state that the fix is only limited to strings. They simply talk about hash randomization - which in my opinion implies ALL hash algorithms; which is why I asked the question.
I am not sure how much should be exposed about the scope of security fixes but you can understand my (and other's) confusion.
I am aware that applications shouldn't make assumptions about the value of any given hash value - apart from some simple assumptions based hash value equality (i.e. if two objects have different hash values they can't be the same value).
The hash values of Python objects are calculated by the __hash__ method, so arbitrary objects can do what they like, including degenerate algorithms such as: class X: def __hash__(self): return 7 So it's impossible to randomize ALL hashes at the language level. Only str and bytes hashes are randomized, because they're the ones most likely to be exploitable - for instance, a web server will receive a query like "http://spam.example/target?a=1&b=2&c=3" and provide a dictionary {"a":1, "b":2, "c":3}. Similarly, a JSON decoder is always going to create string keys in its dictionaries (JSON objects). Do you know of any situation in which an attacker can provide the keys for a dict/set as integers?
/B//TW : // // //This question was prompted by a question on a social media platform about the whether hash values are transferable between across platforms. Everything I could find stated that after Python 3.3 ALL hash values were randomized - but that clearly isn't the case; and the original questioner identified that some hash values are randomized and other aren't.// /
That's actually immaterial. Even if the hashes weren't actually randomized, you shouldn't be making assumptions about anything specific in the hash, save that *within one Python process*, two equal values will have equal hashes (and therefore two objects with unequal hashes will not be equal).
//I did suggest strongly to the original questioner that relying on the same hash value across different platforms wasn't a clever solution - their original plan was to store hash values in a cross system database to enable quick retrieval of data (!!!). I did remind the OP that a hash value wasn't guaranteed to be unique anyway - and they might come across two different values with the same hash - and no way to distinguish between them if all they have is the hash. Hopefully their revised design will store the key, not the hash./
Uhh.... if you're using a database, let the database do the work of being a database. I don't know what this "cross system database" would be implemented in, but if it's a proper multi-user relational database engine like PostgreSQL, it's already going to have way better indexing than anything you'd do manually. I think there are WAY better solutions than worrying about Python's inbuilt hashing. If you MUST hash your data for sharing and storage, the easiest solution is to just use a cryptographic hash straight out of hashlib.py. ChrisA
Chris, I entirely agree. The same questioner also asked about the fastest data type to use as a key in a dictionary; and which data structure is fastest. I get the impression the person is very into micro-optimization, without profiling their application. It seems every choice is made based on the speed of that operation; without consideration of how often that operation is used. On 17/05/18 09:16, Chris Angelico wrote:
On Thu, May 17, 2018 at 5:21 PM, Anthony Flury via Python-Dev
wrote: Victor, Thanks for the link, but to be honest it will just confuse people - neither the link or the related bpo entries state that the fix is only limited to strings. They simply talk about hash randomization - which in my opinion implies ALL hash algorithms; which is why I asked the question.
I am not sure how much should be exposed about the scope of security fixes but you can understand my (and other's) confusion.
I am aware that applications shouldn't make assumptions about the value of any given hash value - apart from some simple assumptions based hash value equality (i.e. if two objects have different hash values they can't be the same value). The hash values of Python objects are calculated by the __hash__ method, so arbitrary objects can do what they like, including degenerate algorithms such as:
class X: def __hash__(self): return 7 Agreed - I should have said the default hash algorithm. Hashes for custom object are entirely application dependent.
So it's impossible to randomize ALL hashes at the language level. Only str and bytes hashes are randomized, because they're the ones most likely to be exploitable - for instance, a web server will receive a query like "http://spam.example/target?a=1&b=2&c=3" and provide a dictionary {"a":1, "b":2, "c":3}. Similarly, a JSON decoder is always going to create string keys in its dictionaries (JSON objects). Do you know of any situation in which an attacker can provide the keys for a dict/set as integers? I was just asking the question - rather than critiquing the fault-fix. I am actually more concerned that the documentation relating to the fix doesn't make it clear that only strings have their hashes randomised.
/B//TW : // // //This question was prompted by a question on a social media platform about the whether hash values are transferable between across platforms. Everything I could find stated that after Python 3.3 ALL hash values were randomized - but that clearly isn't the case; and the original questioner identified that some hash values are randomized and other aren't.// / That's actually immaterial. Even if the hashes weren't actually randomized, you shouldn't be making assumptions about anything specific in the hash, save that *within one Python process*, two equal values will have equal hashes (and therefore two objects with unequal hashes will not be equal). Entirely agree - I was just trying to get to the bottom of the difference - especially considering that the documentation I could find implied that all hash algorithms had been randomized. //I did suggest strongly to the original questioner that relying on the same hash value across different platforms wasn't a clever solution - their original plan was to store hash values in a cross system database to enable quick retrieval of data (!!!). I did remind the OP that a hash value wasn't guaranteed to be unique anyway - and they might come across two different values with the same hash - and no way to distinguish between them if all they have is the hash. Hopefully their revised design will store the key, not the hash./ Uhh.... if you're using a database, let the database do the work of being a database. I don't know what this "cross system database" would be implemented in, but if it's a proper multi-user relational database engine like PostgreSQL, it's already going to have way better indexing than anything you'd do manually. I think there are WAY better solutions than worrying about Python's inbuilt hashing. Agreed If you MUST hash your data for sharing and storage, the easiest solution is to just use a cryptographic hash straight out of hashlib.py. As stated before - I think the original questioner was intent on micro optimizations - and they had hit on the idea that storing an integer would be quicker than storing as string - entirely ignoring both the practicality of trying to code all strings into a value (since hashes aren't guaranteed not to collide), and the issues of trying to reverse that translation once the stored key had been retrieved. ChrisA
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/anthony.flury%40btinterne...
Thanks for your comments :-) -- -- Anthony Flury email : *Anthony.flury@btinternet.com* Twitter : *@TonyFlury https://twitter.com/TonyFlury/*
On Fri, May 18, 2018 at 12:15 AM, Anthony Flury via Python-Dev
Chris, I entirely agree. The same questioner also asked about the fastest data type to use as a key in a dictionary; and which data structure is fastest. I get the impression the person is very into micro-optimization, without profiling their application. It seems every choice is made based on the speed of that operation; without consideration of how often that operation is used.
Sounds like we're on the same page here.
On 17/05/18 09:16, Chris Angelico wrote:
The hash values of Python objects are calculated by the __hash__ method, so arbitrary objects can do what they like, including degenerate algorithms such as:
class X: def __hash__(self): return 7
Agreed - I should have said the default hash algorithm. Hashes for custom object are entirely application dependent.
There isn't a single "default hash algorithm"; in fact, I'm not sure that there's even a single algorithm used for all strings. Certainly the algorithm used for integers is completely different from the one(s) used for strings; we have a guarantee that ints and floats representing the same real number are going to have the same hash (even if that hash isn't equal to the number - hash(1e22)==hash(10**22)!=10**22 is True), since they compare equal. The algorithms used and the resulting hashes may change between Python versions, when you change interpreters (PyPy vs Jython vs CPython vs Brython...), or even when you change word sizes, I believe (32-bit vs 64-bit). So, this is (a) a premature optimization, (b) depending on something that's not guaranteed, and (c) is a great way to paint yourself into a corner. Perfect! :) ChrisA
participants (6)
-
Anthony Flury
-
Chris Angelico
-
Christian Heimes
-
Greg Ewing
-
Raymond Hettinger
-
Victor Stinner