Hash randomization for which types?

Hello, Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly? I'm not a web programmer, but don't web applications also use dictionaries that are indexed by, say, tuples of integers? Just curious... Thanks, Christoph

On 2/16/2016 1:48 AM, Christoph Groth wrote:
Hello,
Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly?
Search archives of this list... it was discussed at length.
I'm not a web programmer, but don't web applications also use dictionaries that are indexed by, say, tuples of integers?
Sure, and that is the biggest part of the reason they were randomized. I think hashes of all types have been randomized, not _just_ the list you mentioned.

On Tue, Feb 16, 2016 at 11:56:55AM -0800, Glenn Linderman wrote:
On 2/16/2016 1:48 AM, Christoph Groth wrote:
Hello,
Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly?
Search archives of this list... it was discussed at length.
There's a lot of discussion on the mailing list. I think that this is the very start of it, in Dec 2011: https://mail.python.org/pipermail/python-dev/2011-December/115116.html and continuing into 2012, for example: https://mail.python.org/pipermail/python-dev/2012-January/115577.html https://mail.python.org/pipermail/python-dev/2012-January/115690.html and a LOT more, spread over many different threads and subject lines. You should also read the issue on the bug tracker: http://bugs.python.org/issue13703 My recollection is that it was decided that only strings and bytes need to have their hashes randomized, because only strings and bytes can be used directly from user-input without first having a conversion step with likely input range validation. In addition, changing the hash for ints would break too much code for too little benefit: unlike strings, where hash collision attacks on web apps are proven and easy, hash collision attacks based on ints are more difficult and rare. See also the comment here: http://bugs.python.org/issue13703#msg151847
I'm not a web programmer, but don't web applications also use dictionaries that are indexed by, say, tuples of integers?
Sure, and that is the biggest part of the reason they were randomized.
But they aren't, as far as I can see: [steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475 [steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475 Web apps can use dicts indexed by anything that they like, but unless there is an actual attack, what does it matter? Guido makes a good point about security here: https://mail.python.org/pipermail/python-dev/2013-October/129181.html
I think hashes of all types have been randomized, not _just_ the list you mentioned.
I'm pretty sure that's not actually the case. Using 3.6 from the repo (admittedly not fully up to date though), I can see hash randomization working for strings: [steve@ando 3.6]$ ./python -c "print(hash('abc'))" 11601873 [steve@ando 3.6]$ ./python -c "print(hash('abc'))" -2009889747 but not for ints: [steve@ando 3.6]$ ./python -c "print(hash(42))" 42 [steve@ando 3.6]$ ./python -c "print(hash(42))" 42 which agrees with my recollection that only strings and bytes would be randomized. -- Steve

I thought you are right. Here is the source code in python 2.7.11: long PyObject_Hash(PyObject *v) { PyTypeObject *tp = v->ob_type; if (tp->tp_hash != NULL) return (*tp->tp_hash)(v); /* To keep to the general practice that inheriting * solely from object in C code should work without * an explicit call to PyType_Ready, we implicitly call * PyType_Ready here and then check the tp_hash slot again */ if (tp->tp_dict == NULL) { if (PyType_Ready(tp) < 0) return -1; if (tp->tp_hash != NULL) return (*tp->tp_hash)(v); } if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) { return _Py_HashPointer(v); /* Use address as hash value */ } /* If there's a cmp but no hash defined, the object can't be hashed */ return PyObject_HashNotImplemented(v); } If object has hash function, it will be used. If not, _Py_HashPointer will be used. Which _Py_HashSecret are not used. And I checked reference of _Py_HashSecret. Only bufferobject, unicodeobject and stringobject use _Py_HashSecret. On Wed, Feb 17, 2016 at 9:54 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Feb 16, 2016 at 11:56:55AM -0800, Glenn Linderman wrote:
On 2/16/2016 1:48 AM, Christoph Groth wrote:
Hello,
Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly?
Search archives of this list... it was discussed at length.
There's a lot of discussion on the mailing list. I think that this is the very start of it, in Dec 2011:
https://mail.python.org/pipermail/python-dev/2011-December/115116.html
and continuing into 2012, for example:
https://mail.python.org/pipermail/python-dev/2012-January/115577.html https://mail.python.org/pipermail/python-dev/2012-January/115690.html
and a LOT more, spread over many different threads and subject lines.
You should also read the issue on the bug tracker:
http://bugs.python.org/issue13703
My recollection is that it was decided that only strings and bytes need to have their hashes randomized, because only strings and bytes can be used directly from user-input without first having a conversion step with likely input range validation. In addition, changing the hash for ints would break too much code for too little benefit: unlike strings, where hash collision attacks on web apps are proven and easy, hash collision attacks based on ints are more difficult and rare.
See also the comment here:
http://bugs.python.org/issue13703#msg151847
I'm not a web programmer, but don't web applications also use dictionaries that are indexed by, say, tuples of integers?
Sure, and that is the biggest part of the reason they were randomized.
But they aren't, as far as I can see:
[steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475 [steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475
Web apps can use dicts indexed by anything that they like, but unless there is an actual attack, what does it matter? Guido makes a good point about security here:
https://mail.python.org/pipermail/python-dev/2013-October/129181.html
I think hashes of all types have been randomized, not _just_ the list you mentioned.
I'm pretty sure that's not actually the case. Using 3.6 from the repo (admittedly not fully up to date though), I can see hash randomization working for strings:
[steve@ando 3.6]$ ./python -c "print(hash('abc'))" 11601873 [steve@ando 3.6]$ ./python -c "print(hash('abc'))" -2009889747
but not for ints:
[steve@ando 3.6]$ ./python -c "print(hash(42))" 42 [steve@ando 3.6]$ ./python -c "print(hash(42))" 42
which agrees with my recollection that only strings and bytes would be randomized.
-- Steve _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/shell909090%40gmail.com
-- 彼節者有間,而刀刃者無厚;以無厚入有間,恢恢乎其於游刃必有餘地矣。 blog: http://shell909090.org/blog/ twitter: @shell909090 <https://twitter.com/shell909090> about.me: http://about.me/shell909090

Note that hashing in python 2.7 and prior to 3.4 is simply broken and the randomization does not do nearly enough, see https://bugs.python.org/issue14621 On Wed, Feb 17, 2016 at 4:45 AM, Shell Xu <shell909090@gmail.com> wrote:
I thought you are right. Here is the source code in python 2.7.11:
long PyObject_Hash(PyObject *v) { PyTypeObject *tp = v->ob_type; if (tp->tp_hash != NULL) return (*tp->tp_hash)(v); /* To keep to the general practice that inheriting * solely from object in C code should work without * an explicit call to PyType_Ready, we implicitly call * PyType_Ready here and then check the tp_hash slot again */ if (tp->tp_dict == NULL) { if (PyType_Ready(tp) < 0) return -1; if (tp->tp_hash != NULL) return (*tp->tp_hash)(v); } if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) { return _Py_HashPointer(v); /* Use address as hash value */ } /* If there's a cmp but no hash defined, the object can't be hashed */ return PyObject_HashNotImplemented(v); }
If object has hash function, it will be used. If not, _Py_HashPointer will be used. Which _Py_HashSecret are not used. And I checked reference of _Py_HashSecret. Only bufferobject, unicodeobject and stringobject use _Py_HashSecret.
On Wed, Feb 17, 2016 at 9:54 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Feb 16, 2016 at 11:56:55AM -0800, Glenn Linderman wrote:
On 2/16/2016 1:48 AM, Christoph Groth wrote:
Hello,
Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly?
Search archives of this list... it was discussed at length.
There's a lot of discussion on the mailing list. I think that this is the very start of it, in Dec 2011:
https://mail.python.org/pipermail/python-dev/2011-December/115116.html
and continuing into 2012, for example:
https://mail.python.org/pipermail/python-dev/2012-January/115577.html https://mail.python.org/pipermail/python-dev/2012-January/115690.html
and a LOT more, spread over many different threads and subject lines.
You should also read the issue on the bug tracker:
http://bugs.python.org/issue13703
My recollection is that it was decided that only strings and bytes need to have their hashes randomized, because only strings and bytes can be used directly from user-input without first having a conversion step with likely input range validation. In addition, changing the hash for ints would break too much code for too little benefit: unlike strings, where hash collision attacks on web apps are proven and easy, hash collision attacks based on ints are more difficult and rare.
See also the comment here:
http://bugs.python.org/issue13703#msg151847
I'm not a web programmer, but don't web applications also use dictionaries that are indexed by, say, tuples of integers?
Sure, and that is the biggest part of the reason they were randomized.
But they aren't, as far as I can see:
[steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475 [steve@ando 3.6]$ ./python -c "print(hash((23, 42, 99, 100)))" 1071302475
Web apps can use dicts indexed by anything that they like, but unless there is an actual attack, what does it matter? Guido makes a good point about security here:
https://mail.python.org/pipermail/python-dev/2013-October/129181.html
I think hashes of all types have been randomized, not _just_ the list you mentioned.
I'm pretty sure that's not actually the case. Using 3.6 from the repo (admittedly not fully up to date though), I can see hash randomization working for strings:
[steve@ando 3.6]$ ./python -c "print(hash('abc'))" 11601873 [steve@ando 3.6]$ ./python -c "print(hash('abc'))" -2009889747
but not for ints:
[steve@ando 3.6]$ ./python -c "print(hash(42))" 42 [steve@ando 3.6]$ ./python -c "print(hash(42))" 42
which agrees with my recollection that only strings and bytes would be randomized.
-- Steve _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/shell909090%40gmail.com
-- 彼節者有間,而刀刃者無厚;以無厚入有間,恢恢乎其於游刃必有餘地矣。 blog: http://shell909090.org/blog/ twitter: @shell909090 about.me: http://about.me/shell909090
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/fijall%40gmail.com

Steven D'Aprano wrote:
On Tue, Feb 16, 2016 at 11:56:55AM -0800, Glenn Linderman wrote:
On 2/16/2016 1:48 AM, Christoph Groth wrote:
Recent Python versions randomize the hashes of str, bytes and datetime objects. I suppose that the choice of these three types is the result of a compromise. Has this been discussed somewhere publicly?
Search archives of this list... it was discussed at length.
There's a lot of discussion on the mailing list. I think that this is the very start of it, in Dec 2011: (...)
I tried searching myself for an hour or so, but though I found many discussions, I didn't see any discussion about whether hashes of other types should be randomized as well. The relevant PEP also doesn't touch this issue.
My recollection is that it was decided that only strings and bytes need to have their hashes randomized, because only strings and bytes can be used directly from user-input without first having a conversion step with likely input range validation. In addition, changing the hash for ints would break too much code for too little benefit: unlike strings, where hash collision attacks on web apps are proven and easy, hash collision attacks based on ints are more difficult and rare.
See also the comment here:
Perfect, that's exactly what I was looking for. I am reassured that this has been thought through. Thanks a lot! Christoph

Glenn Linderman writes:
I think hashes of all types have been randomized, not _just_ the list you mentioned.
Yes. There's only one hash function used, which operates on byte streams IIRC. That function now has a random offset. The details of hashing each type are in the serializations to byte streams.

Stephen J. Turnbull wrote:
Glenn Linderman writes:
I think hashes of all types have been randomized, not _just_ the list you mentioned.
Yes. There's only one hash function used, which operates on byte streams IIRC. That function now has a random offset. The details of hashing each type are in the serializations to byte streams.
Could you please elaborate? Numbers are not hashed as byte streams, at least not up to Python 3.5. I am quite familiar with the way hashing of numbers is done in Python 2 & 3. (I had to re-implement this for a project of mine: https://pypi.python.org/pypi/tinyarray/)

Christoph Groth writes:
Stephen J. Turnbull wrote:
Yes. There's only one hash function used, which operates on byte streams IIRC. That function now has a random offset. The details of hashing each type are in the serializations to byte streams.
Could you please elaborate? Numbers are not hashed as byte streams,
Just a stupid mistake on my part. Should have reviewed the code first. I'll shut up now, take my fly meds, get some sleep, drink coffee in the morning, and then take an axe to my keyboard. :-( Steve

On 02/16/2016 09:22 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
I think hashes of all types have been randomized, not _just_ the list you mentioned.
Yes. There's only one hash function used, which operates on byte streams IIRC. That function now has a random offset. The details of hashing each type are in the serializations to byte streams.
Both these statements are wrong. int objects have their own hash algorithm, built in to long_hash() in Objects/longobject.c. The hash of an int is the value of the int, unless it's -1 or doesn't fit into the native type. And ints don't participate in hash randomization. //arry/

On Thu, Feb 18, 2016 at 12:29 AM, Larry Hastings <larry@hastings.org> wrote:
int objects have their own hash algorithm, built in to long_hash() in Objects/longobject.c. The hash of an int is the value of the int, unless it's -1 or doesn't fit into the native type.
Can someone elaborate on this special case, please? I can see the code there, but there's no comment. Is there some value in not hashing to -1? ChrisA

On 02/17/2016 08:49 AM, Chris Angelico wrote:
int objects have their own hash algorithm, built in to long_hash() in Objects/longobject.c. The hash of an int is the value of the int, unless it's -1 or doesn't fit into the native type. Can someone elaborate on this special case, please? I can see the code
On Thu, Feb 18, 2016 at 12:29 AM, Larry Hastings <larry@hastings.org> wrote: there, but there's no comment. Is there some value in not hashing to -1?
Returning -1 indicates an error / exception. So hash functions never return -1 as a hash value. //arry/
participants (8)
-
Chris Angelico
-
Christoph Groth
-
Glenn Linderman
-
Larry Hastings
-
Maciej Fijalkowski
-
Shell Xu
-
Stephen J. Turnbull
-
Steven D'Aprano