set arbitrary hash random seed to ensure reproducible results
Hi, I am wondering if it would be good to add an additional keyword `seed` to the builtin function *hash* to allow us to set arbitrary seed to ensure reproducible results. As far as I know, there exists already the environment variable PYTHONHASHSEED that allows us to set arbitrary seed or disable the seed globally for the python interpreter. However, it looks like that it would be too bold to use that environment variable to change the default behavior because the random seed generation helps improve the security my reducing the risk of hash flooding. In parallel, we have identified a couple of real use cases that require that an arbitrary seed is used for a limited scope. For instance, if we create a caching programming interface that relies on a distributed kv store, it would be very important to make sure that the hash key stays the same when the application is rebooted or replicated. It is generally more cautious to use the above capability to limit the scope to the caching library itself instead of applying the same for all the hash functions of all the python interpreters. WDYT? Kind regards, Hao
On Sat, Dec 18, 2021 at 1:21 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
Hi,
I am wondering if it would be good to add an additional keyword `seed` to the builtin function *hash* to allow us to set arbitrary seed to ensure reproducible results.
The built-in hash() function is extremely generic, so it can't really work that way. Adding a parameter to it would require (a) adding the parameter to every __hash__ method of every object, including user-defined objects; and (b) defining what that would mean when multiple objects' hashes are combined (eg hashing a tuple).
As far as I know, there exists already the environment variable PYTHONHASHSEED that allows us to set arbitrary seed or disable the seed globally for the python interpreter. However, it looks like that it would be too bold to use that environment variable to change the default behavior because the random seed generation helps improve the security my reducing the risk of hash flooding.
In parallel, we have identified a couple of real use cases that require that an arbitrary seed is used for a limited scope. For instance, if we create a caching programming interface that relies on a distributed kv store, it would be very important to make sure that the hash key stays the same when the application is rebooted or replicated. It is generally more cautious to use the above capability to limit the scope to the caching library itself instead of applying the same for all the hash functions of all the python interpreters.
For that sort of thing, it may be more practical to use your own hashing function, possibly a cryptographically secure one. The precise hashing function used by Python isn't guaranteed, so if you need it to be stable across different runs, and especially if you need to seed it in a specific way, I'd recommend hashlib: https://docs.python.org/3/library/hashlib.html ChrisA
On 17 Dec 2021, at 15:28, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Dec 18, 2021 at 1:21 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
Hi,
I am wondering if it would be good to add an additional keyword `seed` to the builtin function *hash* to allow us to set arbitrary seed to ensure reproducible results.
The built-in hash() function is extremely generic, so it can't really work that way. Adding a parameter to it would require (a) adding the parameter to every __hash__ method of every object, including user-defined objects; and (b) defining what that would mean when multiple objects' hashes are combined (eg hashing a tuple).
I would not say the opposite, however maybe it appears to be more complicated than it is really is. Probably it is worth a small analysis?
As far as I know, there exists already the environment variable PYTHONHASHSEED that allows us to set arbitrary seed or disable the seed globally for the python interpreter. However, it looks like that it would be too bold to use that environment variable to change the default behavior because the random seed generation helps improve the security my reducing the risk of hash flooding.
In parallel, we have identified a couple of real use cases that require that an arbitrary seed is used for a limited scope. For instance, if we create a caching programming interface that relies on a distributed kv store, it would be very important to make sure that the hash key stays the same when the application is rebooted or replicated. It is generally more cautious to use the above capability to limit the scope to the caching library itself instead of applying the same for all the hash functions of all the python interpreters.
For that sort of thing, it may be more practical to use your own hashing function, possibly a cryptographically secure one. The precise hashing function used by Python isn't guaranteed, so if you need it to be stable across different runs, and especially if you need to seed it in a specific way, I'd recommend hashlib:
I’ve explored that option, however the siphash24 or fnv under the hood of *hash* seems to be more adapted for this type of use cases in terms of *performance*. Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well? There are obviously also other third party libraries such as *mmh*, however that’ll introduce unnecessary immature dependencies. WDYT? Thank you.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CU4TI3... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Dec 18, 2021 at 1:44 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
For that sort of thing, it may be more practical to use your own hashing function, possibly a cryptographically secure one. The precise hashing function used by Python isn't guaranteed, so if you need it to be stable across different runs, and especially if you need to seed it in a specific way, I'd recommend hashlib:
I’ve explored that option, however the siphash24 or fnv under the hood of *hash* seems to be more adapted for this type of use cases in terms of *performance*. Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well?
That's a more viable option, although maybe it wouldn't even matter. How does hashlib.sha1() performance stack up, and what about a handrolled simple string hashing function in Python? Is performance actually going to be a problem with one of those? ChrisA
On 17 Dec 2021, at 15:49, Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Dec 18, 2021 at 1:44 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
For that sort of thing, it may be more practical to use your own hashing function, possibly a cryptographically secure one. The precise hashing function used by Python isn't guaranteed, so if you need it to be stable across different runs, and especially if you need to seed it in a specific way, I'd recommend hashlib:
I’ve explored that option, however the siphash24 or fnv under the hood of *hash* seems to be more adapted for this type of use cases in terms of *performance*. Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well?
That's a more viable option, although maybe it wouldn't even matter. How does hashlib.sha1() performance stack up, and what about a handrolled simple string hashing function in Python? Is performance actually going to be a problem with one of those?
Great question. I agree that there could be other factors which slow things down much more than the hash function. I assume that this is a function that’ll be potentially called a lot of times, and the cumulated cost won’t be negligible. Maybe for the similar reason, some high performance networking tools adopt the same algorithm <https://www.wireguard.com/protocol/>.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LKXP3P... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, Dec 18, 2021 at 2:21 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
Great question. I agree that there could be other factors which slow things down much more than the hash function. I assume that this is a function that’ll be potentially called a lot of times, and the cumulated cost won’t be negligible. Maybe for the similar reason, some high performance networking tools adopt the same algorithm.
Yup. It's always better to measure than to assume, though :) ChrisA
Hao Hu writes:
On 17 Dec 2021, at 15:28, Chris Angelico <rosuav@gmail.com> wrote:
The built-in hash() function is extremely generic, so it can't really work that way. Adding a parameter to it would require (a) adding the parameter to every __hash__ method of every object, including user-defined objects;
I would not say the opposite, however maybe it appears to be more complicated than it is really is. Probably it is worth a small analysis?
It's the user-defined objects that are the killer here. We don't want to go wrecking dozens of projects' objects.
For instance, if we create a caching programming interface that relies on a distributed kv store,
I would be very suspicious of using Python's hash builtin for such a purpose. The Python hash functions are very carefully tuned for high performance in one application only: equality testing in Python, especially for dicts. Many __hash__ methods omit much of the object being hashed; if the variation in your keys depends only on those attributes, you'll get a lot of collisions. Others are extremely predictable. E.g., most integers and other numbers equal to integers hash to themselves mod 2**61 - 1, I believe -1 is only exception. Being predictable as such may not be a problem for your kv store cache, but predictable == pattern, and if your application happens to match that pattern, you could again end up with a massive collision problem. I imagine this is much less likely to be a problem than the case where keys depend on omitted attributes, since presumably the __hash__ method is designed to cover the whole range. And numbers are the only case I know of offhand.
I'd recommend hashlib:
+1
Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well?
I think that is a good idea. To me, it seems relatively likely to be accepted quickly. However, many cryptographic algorithms are delicate (eg, to avoid timing attacks), so I could be wrong about that. Folks like Christian Heimes might be very concerned about the implementation as well as the algorithm. Note that Python/pyhash.c seems to have implementations of both of these algorithms, although I don't know if these implementations satisfy cryptographic needs. Steve
On 12/18/21 08:44, Stephen J. Turnbull wrote:
Hao Hu writes:
On 17 Dec 2021, at 15:28, Chris Angelico <rosuav@gmail.com> wrote:
The built-in hash() function is extremely generic, so it can't really work that way. Adding a parameter to it would require (a) adding the parameter to every __hash__ method of every object, including user-defined objects;
I would not say the opposite, however maybe it appears to be more complicated than it is really is. Probably it is worth a small analysis?
It's the user-defined objects that are the killer here. We don't want to go wrecking dozens of projects' objects.
For instance, if we create a caching programming interface that relies on a distributed kv store,
I would be very suspicious of using Python's hash builtin for such a purpose. The Python hash functions are very carefully tuned for high performance in one application only: equality testing in Python, especially for dicts. Many __hash__ methods omit much of the object being hashed; if the variation in your keys depends only on those attributes, you'll get a lot of collisions. Others are extremely predictable. E.g., most integers and other numbers equal to integers hash to themselves mod 2**61 - 1, I believe -1 is only exception. Being predictable as such may not be a problem for your kv store cache, but predictable == pattern, and if your application happens to match that pattern, you could again end up with a massive collision problem. I imagine this is much less likely to be a problem than the case where keys depend on omitted attributes, since presumably the __hash__ method is designed to cover the whole range. And numbers are the only case I know of offhand.
It is pretty much the same use case as python's dictionary though, the goal is just to generalize it to use with a distributed kv store. Another big advantage is that it is more user friendly to apply *hash* directly on a type.
I'd recommend hashlib:
+1
Otherwise, would that be useful to add siphash24 or fnv in the hashlib as well?
I think that is a good idea. To me, it seems relatively likely to be accepted quickly. However, many cryptographic algorithms are delicate (eg, to avoid timing attacks), so I could be wrong about that. Folks like Christian Heimes might be very concerned about the implementation as well as the algorithm.
Note that Python/pyhash.c seems to have implementations of both of these algorithms, although I don't know if these implementations satisfy cryptographic needs.
According to the doc, there seems to be 2 categories of hash function. One is for cryptographic purpose, another one is for message authentication code. The algorithms mentioned above could be mostly put into the second category.
Steve
Hao Hu writes:
On 12/18/21 08:44, Stephen J. Turnbull wrote:
Hao Hu writes:
For instance, if we create a caching programming interface that relies on a distributed kv store,
I would be very suspicious of using Python's hash builtin for such a purpose. The Python hash functions are very carefully tuned for high performance in one application only: equality testing in Python, especially for dicts. [...]
It is pretty much the same use case as python's dictionary though, the goal is just to generalize it to use with a distributed kv store.
Sure, you know that because it's your application. But I don't know that, and it's only an example you give to justify a change to Python. The burden on you is not to argue that it works in one application; it's to argue that it works broadly enough to be worth changing a lot stuff in Python, imposing a change burden on any project that implements __hash__ for any of its classes, and for anybody who supports both pre- and post-change version of Python, they need to support both __hash__(object) and __hash__(object, salt) (probably trivial, just def __hash__(self, salt=None):, but I haven't thought about it).
Another big advantage is that it is more user friendly to apply *hash* directly on a type.
Sure, that was the whole point of proposing it and nobody denies it.
On Thu, Dec 23, 2021 at 5:40 PM Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Hao Hu writes:
On 12/18/21 08:44, Stephen J. Turnbull wrote:
Hao Hu writes:
For instance, if we create a caching programming interface that relies on a distributed kv store,
I would be very suspicious of using Python's hash builtin for such a purpose. The Python hash functions are very carefully tuned for high performance in one application only: equality testing in Python, especially for dicts. [...]
It is pretty much the same use case as python's dictionary though, the goal is just to generalize it to use with a distributed kv store.
Sure, you know that because it's your application. But I don't know that, and it's only an example you give to justify a change to Python. The burden on you is not to argue that it works in one application; it's to argue that it works broadly enough to be worth changing a lot stuff in Python, imposing a change burden on any project that implements __hash__ for any of its classes, and for anybody who supports both pre- and post-change version of Python, they need to support both __hash__(object) and __hash__(object, salt) (probably trivial, just def __hash__(self, salt=None):, but I haven't thought about it).
A bit more complicated for anything that builds its hash out of other objects' hashes (eg a tuple), since it would have to avoid calling hash(object, salt) if it was called as __hash__(self, None). Changing the signature of a dunder is generally a pain. Python's hashing function is designed with some extremely specific use-cases in mind. For example, small integers hash to themselves, because this gives good results for dictionaries whose keys are all small integers. That won't be as beneficial if the keyvalue store is distributed (since each node will only have part of the full dictionary), and it also means that the application would be vulnerable to hash collision attacks. As soon as something is networked, the rules change, and I do not see this as a safe choice for a distributed kv store. Exposing the string hashing algorithm *only*, as a convenient and fast way to hash strings, would have some value. Trying to expose, but also control, the overall hash function? Not something I would recommend. ChrisA
On 12/23/21 07:39, Stephen J. Turnbull wrote:
Hao Hu writes:
On 12/18/21 08:44, Stephen J. Turnbull wrote:
Hao Hu writes:
For instance, if we create a caching programming interface that relies on a distributed kv store,
I would be very suspicious of using Python's hash builtin for such a purpose. The Python hash functions are very carefully tuned for high performance in one application only: equality testing in Python, especially for dicts. [...]
It is pretty much the same use case as python's dictionary though, the goal is just to generalize it to use with a distributed kv store.
Sure, you know that because it's your application. But I don't know that, and it's only an example you give to justify a change to Python. The burden on you is not to argue that it works in one application; it's to argue that it works broadly enough to be worth changing a lot stuff in Python, imposing a change burden on any project that implements __hash__ for any of its classes, and for anybody who supports both pre- and post-change version of Python, they need to support both __hash__(object) and __hash__(object, salt) (probably trivial, just def __hash__(self, salt=None):, but I haven't thought about it).
Thanks for thinking about all this. Those are reasonable concerns. - The motivation of having such a discussion is indeed to figure out whether this small requirement could be crafted well enough for python's best interest. - For the example mentioned above, I am wondering whether we could work around it like this: - generalize the signature of __hash__(object) to __hash__(object, *args, **kwargs). In the default implementation, we discard the keyword "salt" and use the default fallback salt if the keyword is not present, otherwise we use the salt specified by the keyword.
Another big advantage is that it is more user friendly to apply *hash* directly on a type.
Sure, that was the whole point of proposing it and nobody denies it. 👍
On Fri, Dec 24, 2021 at 12:14 AM Hao Hu <hao.hu.fr@gmail.com> wrote:
- generalize the signature of __hash__(object) to __hash__(object, *args, **kwargs). In the default implementation, we discard the keyword "salt" and use the default fallback salt if the keyword is not present, otherwise we use the salt specified by the keyword.
That means that EVERY __hash__ implementation that calls hash() MUST pass along all spare arguments. This is a breaking change that will affect a lot of third-party code. Changing a dunder's signature is a very big deal and needs a lot more justification than this. How important *is* it to be able to lock in the seed/salt (they're not the same thing) when hashing arbitrary objects? ChrisA
On Fri, Dec 17, 2021 at 02:07:38PM -0000, Hao Hu wrote:
Hi,
I am wondering if it would be good to add an additional keyword `seed` to the builtin function *hash* to allow us to set arbitrary seed to ensure reproducible results.
I assume you are talking about hashing strings, I believe they are the only objects that are effected by hash randomization.
In parallel, we have identified a couple of real use cases that require that an arbitrary seed is used for a limited scope. For instance, if we create a caching programming interface that relies on a distributed kv store, it would be very important to make sure that the hash key stays the same when the application is rebooted or replicated.
Does your distributed key/value store have to use the built-in hash function? -- Steve
On 17 Dec 2021, at 15:42, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Dec 17, 2021 at 02:07:38PM -0000, Hao Hu wrote:
Hi,
I am wondering if it would be good to add an additional keyword `seed` to the builtin function *hash* to allow us to set arbitrary seed to ensure reproducible results.
I assume you are talking about hashing strings, I believe they are the only objects that are effected by hash randomization.
I am not sure about this, though I agree with you that strings are affected by this.
In parallel, we have identified a couple of real use cases that require that an arbitrary seed is used for a limited scope. For instance, if we create a caching programming interface that relies on a distributed kv store, it would be very important to make sure that the hash key stays the same when the application is rebooted or replicated.
Does your distributed key/value store have to use the built-in hash function?
If you limit the use case strictly to a string and a value, I agree with you that I can rely on the hashing capability of the kv store. However, sometimes I expect something more generic, such as it might be a tuple that mixes numbers, strings and other hashable types at least. Another option might be using serialisation to produce key, but there are 2 things to tackle - serialisation vs hash in terms of speed - a key of dynamic size vs key of fixed size (I tend to believe that fixed size key is more friendly for random access on kv storage side)
-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZEKR2Y... Code of Conduct: http://python.org/psf/codeofconduct/
participants (4)
-
Chris Angelico
-
Hao Hu
-
Stephen J. Turnbull
-
Steven D'Aprano