A proposal to modify `None` so that it hashes to a constant
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
I wrote a doc stating my case here: https://docs.google.com/document/d/1et5x5HckTJhUQsz2lcC1avQrgDufXFnHMin7GlI5... Briefly, 1. The main motivation for it is to allow users to get a predictable result on a given input (for programs that are doing pure compute, in domains like operations research / compilation), any time they run their program. Having stable repro is important for debugging. Notebooks with statistical analysis are another similar case where this is needed: you might want other people to run your notebook and get the same result you did. 2. The reason the hash non-determinism of None matters in practice is that it can infect commonly used mapping key types, such as frozen dataclasses containing `Optional[int]` fields. 3. Non-determinism emerging from other value types like `str` can be disabled by the user using `PYTHONHASHSEED`, but there's no such protection against `None`. All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost. The need to modify None itself is caused by two factors - `Optional` being implemented effectively as `T | None` in Python as a strongly established practice - The fact that `__hash__` is an intrinsic property of a type in Python, the hashing function cannot be externally supplied to its builtin container types. So we have to modify the type None itself, rather than write some alternative hasher that we could use if we care about deterministic behavior across runs. This was debated at length over the forum and in discord. I also posted a PR for it, and it was closed, see: https://github.com/python/cpython/issues/99540 https://github.com/python/cpython/pull/99541 Asking for opinions, and to re-open the PR, provided there is enough support for such a change to take place.
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Sun, Nov 27, 2022 at 11:36 AM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
But the hash of an object is not guaranteed to be stable by the language, so I would argue someone expecting that is expected to convert random-access data structures to ones that are consistent when necessary (e.g. sorted lists).
I don't see why the hashing within a dict needs to be consistent as that's not a guarantee we make with Python.
If I remember correctly, PYTHONHASHSEED was added to help folks migrate when we added randomness to hashing as they had accidentally come to expect a consistent iteration order on dictionary keys. I wouldn't take its existence to suggest that PYTHONHASHSEED is meant to make **all** hashing consistent (e.g. people who implement their own __hash__ don't have to follow that expectation).
All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost.
That's actually by design. Sets are not meant to be deterministic conceptually as they are essentially a bag of stuff. If you want deterministic ordering you should convert it to a list and sort the list.
I personally agree with the arguments made in the issue, so I'm afraid I don't' support making the change as we worked hard to stop people from relying on consistent hashing/iteration from random-access data structures like dict and set. -Brett
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, 29 Nov 2022 at 09:51, Brett Cannon <brett@python.org> wrote:
... we worked hard to stop people from relying on consistent hashing/iteration from random-access data structures like dict and set.
Say what? Who's been working hard to stop people from relying on consistent iteration order for a dict? ChrisA
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
What does "sets are not meant to be deterministic" even mean? Mathematically speaking sets are not meant to be ordered in any particular way but a computational implementation has to have some order and there is no reason to prefer non-deterministic order in general. Actually determinism in a computational context is usually a very valuable feature. I find it hard to see why non-determinism is "by design". Also it isn't usually possible to sort a list containing None: In [9]: sorted([None, 1, 2]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-9-344383189210> in <module> ----> 1 sorted([None, 1, 2]) TypeError: '<' not supported between instances of 'int' and 'NoneType' It would be useful to have a straight-forward way to sort a set into a deterministic ordering but no such feature exists after the Py3K changes (sorted used to do this in Python 2.x). -- Oscar
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
To stir up some more fire, I would personally be fine with sets having the same ordering guarantees as dicts, *IF* it can be done without performance degradations. So far nobody has come up with a way to ensure that. "Sets weren't meant to be deterministic" sounds like a remnant of the old philosophy, where we said the same about dicts -- until they became deterministic without slowing down, and then everybody loved it. Regarding hash(None), I think that there is something to be said for making that stable, and the arguments against it feel like rationalizations for FUD. We've survived way larger controversies. I also note that hash(()) is apparently stable. On Mon, Nov 28, 2022 at 3:16 PM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/834c5/834c5b8c3073c47a0888f391ecdd8a1c8e3859b6" alt=""
On 29/11/22 12:51 pm, Guido van Rossum wrote:
I got the impression that there were some internal language reasons to want stable dicts, e.g. so that the class dict passed to __prepare__ preserves the order in which names are assigned in the class body. Are there any such use cases for stable sets? -- Greg
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
Nah, `__prepare__` very much predates stable dicts and that problem was solved differently. On Mon, Nov 28, 2022 at 4:46 PM Greg Ewing <gcewing@snap.net.nz> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 01:34:54PM +1300, Greg Ewing wrote:
Some people wanted order preserving kwargs, I think for web frameworks. There was even a discussion for a while about using OrderedDict for kwargs and leaving dicts unordered. For me, the biggest advantage of preserving input order in dicts is that it is now easier to write doctests using the dict's repr. It would be nice to be able to do the same for sets, but not nice enough to justify making them bigger or slower. -- Steve
data:image/s3,"s3://crabby-images/f3aca/f3aca73bf3f35ba204b73202269569bd49cd2b1e" alt=""
On Mon, Nov 28, 2022 at 6:45 PM Steven D'Aprano <steve@pearwood.info> wrote:
See https://peps.python.org/pep-0468/ (kwargs) and https://peps.python.org/pep-0520/ (class definition body). I re-implemented OrderedDict in C for this purpose. Literally right after I had finished that, Inada-san showed up with his compact dict implementation. Many of us were at the first core sprint at the time and there was a lot of excitement about compact dict. It was merged right away (for 3.6) and there was quick agreement that we could depend on dict insertion ordering internally (for a variety of use cases, IIRC). Thus, suddenly both my PEPs were effectively implemented, so we marked them as approved and moved on. FWIW, making the insertion ordering an official part of the language happened relatively soon afterward, though for 3.7, not 3.6. [1] I'm pretty sure there's a python-dev thread about that. The stdtypes docs were updated [2] soon after, and we finally got around to updating the language [3] a couple years later. -eric [1] https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights [2] https://bugs.python.org/issue33609 [3] https://bugs.python.org/issue39879
data:image/s3,"s3://crabby-images/a5852/a58528fd9ca9a821e9124b95ff002f6460f0f518" alt=""
For what it's worth, as a user of the language I would like sets to behave as much as possible as-if they were basically dicts that map all elements to `()`. That way I'd have to keep one less mental model in my head. I deliberately say 'as-if' because when I'm a user of the language, I don't care how it's implemented. (Just like I don't have to care as a user that we have (at least) two different ways dicts are represented internally.)
data:image/s3,"s3://crabby-images/e1c23/e1c23d411ff6b304943e192120a5724cbc381a3e" alt=""
On 29/11/2022 00:51, Guido van Rossum wrote:
Hi all, hi Guido, Just as a data point, PyPy's sets have been using the same stable ordering like dicts since our introduction of insertion ordered dicts. We have a number of other set optimizations so it's not an apple-to-apple comparison, but still. Cheers, Carl Friedrich
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Mon, Nov 28, 2022 at 11:13:34PM +0000, Oscar Benjamin wrote:
On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
I'm not Brett, so I'm not answering for him, but many people (sometimes including me) misuse "not deterministic" to refer to set order and the old dict order. Of course set order is deterministic, it is determined by the combination of the set implementation, the hashing algorithm, and the history of the set -- all the items that have ever appeared in the set (both those removed and those that remain). Set order is deterministic in the same way that roulette wheels are deterministic. We don't have a good term for this: the items don't appear in random order. But it is not predictable either, except in very special cases. "Arbitrary" is not right either, since that implies we can easily choose whatever set order we want. I think that physicists call this "deterministic chaos" https://www.quora.com/Theoretical-Physics-What-is-deterministic-but-unpredic... (sorry for the quora link) so I guess we might say that set iteration order is "deterministically chaotic" if you want to be precise, but life is too short for that level of pedantry (and that's coming from me, a Grade A Pedant *wink*) so people often describe it as random, arbitrary, or non-deterministic when it's none of those things :-). Maybe we could call set order "pseudo-random". Getting back to the design part, I think that what Brett is trying to get across is not that the chaotic set order is in and of itself a requirement, but that given the other requirements, chaotic set order is currently considered a necessary condition. As I understand it, we could make sets ordered, but only at the cost of space (much more memory) or time (slower) or both. I am sure that Guido is correct that **if** somebody comes up with a fast, efficient ordered set implementation that doesn't perform worse than the current implementation, we will happily swap to giving sets a predictable order, as we did with dicts. (Practicality beats purity -- even if sets are *philosophically* unordered, preserving input order is too useful to give up unless we gain something in return.) But I don't think it is fair or kind to call Brett's argument FUD. At the very least it is uncharitable interpretation of Brett's position.
`sorted()` works fine on homogeneous sets. It is only heterogeneous sets that are a problem, and in practice, that usually means None mixed in with some other type. -- Steve
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Tue, 29 Nov 2022 at 01:33, Steven D'Aprano <steve@pearwood.info> wrote:
Let's split this into two separate questions: 1. Is it *innately* good that set order is non-deterministic? 2. Are there some other reasons why it is good to choose a model that implies *non-deterministic* set order? The answer to 1. is emphatically NO. In fact the question itself is badly posed: why are we even asking about "set order" rather than the benefits of determinism in general? If I want my code to be deterministic then that's just something that I want regardless of whether sets, dicts, floats etc are involved. As for point 2. the fact that sets are currently non-deterministic is actually a relatively new thing in Python. Before hash-randomisation set and dict order *was* deterministic but with an arbitrary order. That was only changed because of a supposed security issue with hash collisions. Prior to that it was well understood that determinism was beneficial (honestly I don't understand why I have to state this point explicitly: determinism is almost always best in our context). Please everyone don't confuse arbitrary order, implementation defined order and non-deterministic order. There is no reason why sets in Python need to have a *non-deterministic* order or at least why there shouldn't be a way to control that. There is no performance penalty in making the order *deterministic*. (If you think that there might be a performance penalty then you haven't understood the suggestion!)
That is of course precisely the context for this thread! -- Oscar
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, 29 Nov 2022 at 13:12, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
To clarify: The hash collision attack is a very real one, but specific to dictionaries of string keys, since there are quite a few ways for an attacker to send a string that gets automatically parsed into such a dictionary (eg web app frameworks where the request parameters are made available as a dictionary). But since that attack surface is *so* specific, randomization of non-string hashes is unimportant. ChrisA
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 02:07:34AM +0000, Oscar Benjamin wrote:
Let's split this into two separate questions:
Let's not. Your first question about non-deterministic set order being "innately good" is a straw man: as we've already discussed, set order is not non-deterministic (except in the informal sense of "hard to predict") and I don't think anyone is arguing in favour of keeping set order unpredictable even if there are faster, more compact, simpler implementations which preserve order. Talking about determinism is just muddying the waters. Sets are deterministic: their order is determinied by the implementation, the set history, and potentially any environmental sources of entropy used in address randomisation. Deterministic does not mean predictable. If we want to get this discussion onto a more useful path, we should start with a security question: The hash of None changes because of address randomisation. Address randomisation is enabled as a security measure. Are there any security consequences of giving None a constant hash? I can't see any, but then I couldn't see the security consequences of predictable string hashes until they were pointed out to me. So it would be really good to have some security experts comment on whether this is safe or not.
why are we even asking about "set order" rather than the benefits of determinism in general?
Because this entire discussion is motivated by the OP who wants consistent set order across multiple runs of his Python application. That's what he needs; having None hash to a constant value is just a means to that end. His sets contain objects whose hashes depend on `Optional[int]`, which means sometimes they include None, and when he runs under an interpreter built with address randomisation, the hash of None can change. Even if we grant None a constant hash, that still does not guarantee consistent set order across runs. At best, we might get such consistent order as an undocumented and changeable implementation detail, until we change the implementation of hashing, or of sets, or of something seemingly unrelated like address randomisation. -- Steve
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
I can't either. I can point out that the complexity attack via hash collisions is not possible on None specifically because it is a singleton, so: 1. There may only be one None in a dict at most, 2. For composite types that depend on None and other things, like say a tuple of string and an optional int, they become as resistant to attack as the same types without None, in this case string and int. 3. Python only explicitly added this security (hash randomization) feature to string and bytes hashes, and if your composite key type depends on those types, it will still be protected regardless of what None does
Not entirely. I do explain in my doc why there is a foundational reason why, once the choice was made to have None represent the `None` of Optional[T], it entered the family of types that you would expect to have deterministic behavior. And as the hashing function is intrinsic to types in Python, it is included in that.
ASLR will not cause any trouble so long as I keep objects with identity based hashing out of my sets. Or at least, any sets I later iterate on and run imperative code with side-effects under the loop. The possibility of disabling ASLR is not a reason to dismiss this change. No one guarantees a user of Python is in a position to make infosec related decisions on the computer system they are working on. Regarding the possibilities of hashes changing for the worse (in this regard) - sure. Anything is possible. Regarding sets - the set implementation is deterministic, and always has been (for decades). Yes, in theory it is possible that a new version or implementation of Python will make sets non-deterministic - shuffle themselves in the background independently from their history of operations, or what have you, and then my change loses all of its value. Like I said, anything is possible. But I think these last two points are essentially FUD. I made my proposal because I believe the FUD scenarios are strongly unlikely. (and even then, at worst we end up with a "practically useless" behavior on None, that can also be freely reverted along with such other changes anyway)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
[Oscar Benjamin]
(If you think that there might be a performance penalty then you haven't understood the suggestion!)
Then I don't understand the question, and I will refrain from participating further in this discussion. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
Looks like it's just miscommunication. There is the original proposal I made, strictly about how None is ought to be hashed, and then there is the separate topic of changing the stability properties of iteration on sets, and whether that can be made with/without a performance regression...
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
It does make your argument invalid though, since it's based on this assumption that I was asking for a requirement on iteration order (e.g. like dict's iteration order = insertion order guarantee), which is not the case. Again, determinism means that given all input data and commands fed to a data structure is the same, it will arrive at the same observable state, any time you start from scratch and replay this workload. In the context of sets, "all input data" includes the hashing function itself, and "observable state" also includes the order in which items will be returned if iterated. Note that there is NO requirement here on what that order might be. Under this definition, sets in Python are deterministic, and _always_ have been. And even outside of Python, there are aren't many cases where people willingly want to use data structures with non deterministic behavior. It usually involves concurrency (in the form of multithreading) and extreme performance requirements. And it's never the "standard" choice even in languages that do offer this. Determinism is generally considered as a valuable property in computation, at least when it is feasible to maintain it.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
Yoni, I think this answer is disingenious. Over on the Discuss threads, you have made it clear that the primary reason why you want hash(None) to return a constant value is so that set iteration order will be consistent from one run to another. Sure, you may *also* have some philosophical preference for None having a fixed hash, but you have barely mentioned that the same applies to Ellipsis and NotImplemented (and then only in passing), and that's not your driving motivation for this proposal. Let's consider a thought-experiment: suppose we agree to your proposal to make hash(None) return a constant, but at the same time modify the set iteration algorithm so that it starts from a different position each time you iterate, making set iteration order even more unpredictable and deterministically chaotic than it is now. Would that satisfy you? From your posts on Discuss, I don't expect that you will be happy with this outcome. Personally, and I don't expect that this will be a popular opinion, I wish that data structures that don't guarantee their iteration order would actively mix up their iteration order so that people wouldn't be tempted to rely on whatever accidental order they happen to get. Sure, it would cost a little bit of code to implement, but think of the savings in unproductive arguments and discussions like this one :-( It would also break the invariant that `repr(data) == repr(data)` but it is times like this that I feel that it would be worth it. -- Steve
data:image/s3,"s3://crabby-images/a3b9e/a3b9e3c01ce9004917ad5e7689530187eb3ae21c" alt=""
But it wouldn't -- equality of sets doesn't depend on iteration order. And even if you are talking about other types (this *is* a thought experiment), any type in which equality depends on order certainly shouldn't be jiggled in this way. And I think this is relevant -- I understand that it's handy that order be deterministic -- but even if it is made so in the current implementation, it still may not be consistent across different Python versions and implementations. Which is why I would think that relying on determinism of order is a bad idea anyway -- yes, the same computation should yield the same result -- but what does "same" mean? As far as sets go -- "same" is equality, and if you are checking anything else about them (such as iteration order), I think it's a mistake. Which I think is why Guido (or maybe someone else introduced the idea) was talking about order-preserving sets -- THAT would provide a larger benefit, and would change the specification of sets so that you *could* count on it across versions and implementations, like we can for the modern dict. So yes, the OP wasn't asking for that, but I think the point is that what the OP is asking for is not useful enough to do, whereas a larger proposal might be -- but only if it can be done without loss of performance, which, alas, no one knows how to do. Final confusion -- IIUC, even if the hash of None is made constant, you'd still need to turn off string hash munging to get deterministic order for sets with strings in them, yes? Which really seems to make this even less useful. -CHB On Tue, Nov 29, 2022 at 3:48 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, 30 Nov 2022 at 10:48, Steven D'Aprano <steve@pearwood.info> wrote:
Let's consider an equally-likely thought experiment: suppose that every call to set.__iter__() causes a short busy-wait, spinning the CPU for a little while for absolutely no purpose. There's nothing in the docs that mandates performance, so it would be perfectly reasonable for a future version of Python to do this, right? But it's not something that I would consider *likely*. And, in fact, modifying the iteration algorithm to be completely unpredictable would have the same cost: an RNG lookup for no value whatsoever. It's notable that set() is about the only data type where this is possible. You can't randomize dict iteration order (couldn't do that even before it was locked in, because iterating over keys and values was always required to return them in the same order), and a lot of arbitrary-order collections are built on top of dictionaries. So your thought experiment would be special-casing sets and making them unnecessarily difficult to work with, just so you can point to that arbitrariness and say "hah! GOTCHA! You silly fools shouldn't have been depending on iteration order!!".
Which data structures are those? Please enumerate.
Sure, it would cost a little bit of code to implement, but think of the savings in unproductive arguments and discussions like this one :-(
So basically, it's a bit of useless code, a bit of useless CPU usage, all just so you can win an argument. Makes sense. ChrisA
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
I stand by what I said, there is absolutely nothing disingenious about it.
Again, for the n'th time, read what I wrote above: Under this definition, sets in Python are deterministic, and _always_ have been. I was just stating a fact, same as it is in all other languages I know. You are welcome to verify it by the way, if you can. That's why I don't request any change from set. It works fine and always has.
I'd also change other sentinels to hash to a constant if it were up to me to decide, but there is no practical significance to that so I don't bother dying on that hill (especially now, given the mob reaction). That doesn't make my argument inconsistent in any way. Sorry.
I address this already. I said it is a strongly unlikely FUD scenario. Sets have been behaving deterministically for decades now, across all languages. There's a pretty good reason for that, even if this reason doesn't happen to be codified into the requirements of Python. So as I said, I will take my chances. Again it's something I already addressed here:
_Please_ stop reiterating the same argument over and over and pretending I didn't already make a case against it. It doesn't help, just turns this discussion into a shouting match.
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
I don't think it is disingenuous. There are just a lot of people talking past each other and not quite understanding what each person means because there is confusion about even the intended meaning of terms like "deterministic". I will expand here with enough detail that we should hopefully be able to avoid misunderstanding each other. There are probably other places where you could find mentions of this in the docs but I just took a quick look in the Python 3.5 docs (before hash randomisation) to find this mention of dictionary iteration order: https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects What it says is """ Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. """ The key point is the use of the term "non-random" which here is intended to mean that although no particular ordering is guaranteed you can expect to rerun the same program and get the same result deterministically. A different version or implementation of Python might give a different order but rerunning the same program twice without changing anything should give the same result even if that result depends in some way on the iteration order of some dictionaries. I can't immediately find a similar statement about sets but in practice the same behaviour applied to sets as well. Note carefully that it is this *narrow* form of determinism that Yoni is interested in. Of course there are some caveats to this and the obvious one is that this statement does not apply if there are some objects that use identity based hashing so this is not deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) Running this gives: $ python3.5 t.py A(2) A(1) $ python3.5 t.py A(1) A(2) On the other hand if all of the hashes themselves are deterministic then the program as a whole will be as well so this is deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data def __hash__(self): return hash(self.data) def __eq__(self): return self.data == other.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) $ python3.5 t.py A(1) A(2) $ python3.5 t.py A(1) A(2) So we have two classes of hashable objects: 1. Those with deterministic hash 2. Those with non-deterministic hash A program that avoids depending on the iteration order of sets or dicts containing objects with non-deterministic hash could be deterministic. It is not the case that the program would depend on the iteration order for its *correctness* but just that the behaviour of the program is *reproducible* which is useful in various ways e.g.: - You could say to someone else "run this code with CPython 3.5 and you should be able to reproduce exactly what I see when I run the program". It is common practice e.g. in scientific programming to record things like random seeds so that someone else can precisely reproduce the results shown in a paper or some other work and this in general requires that it is at least possible to make everything deterministic. - When debugging it is useful to be able to reproduce an error condition precisely. Debugging non-deterministic failures can be extremely difficult. In the same way that you might want to reproduce correctly functioning code it is also very useful to be able to reproduce bugs. I can list more examples but really it shouldn't be necessary to justify from first principles why determinism in programming is usually a good thing. There can be reasons sometimes why determinism is undesired or cannot or should not be guaranteed. It should not be controversial though to say that all things being equal determinism is usually a desirable feature and should be preferred by default. I don't think that the 3.5 docs I quoted above used the words "non-random" casually: it was an intended feature and people were aware that breaking that behaviour would be problematic in many situations. Of course in Python 3.6 this determinism was broken with the introduction of hash randomisation for strings. It was considered that for security purposes it would be better to have some internal non-deterministic behaviour to guard against attackers. Specifically the hashes of three types (str, bytes and datetime) were made non-deterministic between subsequent CPython processes. The effect was not only to change purely internal state though but also the observable iteration order of dicts and sets which became non-deterministic from one run of CPython to another. It was anticipated at the time that this might be problematic in some situations (it certainly was!) and so an environment variable PYTHONHASHSEED was introduced in order to restore determinism for cases that needed it. So now if you want to have reproducible behaviour with Python 3.6+ you also need to fix PYTHONHASHSEED as well as avoiding the use of other types of non-deterministically hashable objects in sets and dicts. This is something that I have personally used mainly for the reproducibility of rare bugs e.g. "to reproduce this you should run the following Python code using commit abc123 under CPython 3.6 and with PYTHONHASHSEED=1234". Subsequently in Python 3.7 dict iteration order was changed to make it always deterministic by having it not depend on the hash values at all, with the order depending on the order of insertions into the dict instead. This introduced a *stronger* guarantee of determinism: now the ordering could be expected to be reproducible even with different versions and implementations of Python. For many this seemed to have resolved the problems of undefined, implementation-defined etc ordering. However this only applied to dicts and not sets and as of Python 3.12 any issues about deterministic ordering still remain wherever sets are used. The introduction of hash randomisation means that since Python 3.6 there are now three classes of hashable objects: 1. Objects with deterministic hash (int etc) 2. Objects with non-deterministic hash that can be controlled with PYTHONHASHSEED (bytes, str and datetime). 3. Objects with non-deterministic hash that cannot be controlled (id-based hashing). The question in this thread and others is which of these three classes None should belong to. Although None is just one particular value it is a very commonly used value and its non-deterministic hash can creep through to affect the hash of larger data structures that use recursive hash calls (e.g. a tuple of tuples of ... that somewhere contains None). Also certain situations such as a type hint like Optional[T] as referred to by the OP necessarily use None: $ python -c 'from typing import Optional; print(hash(Optional[int]))' -7631587549837930667 $ python -c 'from typing import Optional; print(hash(Optional[int]))' -6488475102642892491 Somehow this affects frozen dataclasses but I haven't used those myself so I won't demonstrate how to reproduce the problem with them. Here is a survey of types from builtins: NoneType bool bytearray bytes complex dict ellipsis enumerate filter float frozenset int list map memoryview object range reversed set slice str tuple type zip We can divide these into the three classes described above plus the non-hashable types (I haven't checked this in the code, but just experimenting with calling hash): 1. Deterministic hash: bool, complex, float, frozenset, int, range, tuple 2. Hash depends on hash seed: str, bytes 3. Hash depends on id: NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip 4. Non-hashable: bytearray, dict, list, set, slice The question here is whether None belongs in class 3 or class 1. To me it seems clear that there is no advantage in having None be in class 3 except perhaps to save a few simple lines of code: https://github.com/python/cpython/pull/99541/files There is however a worthwhile advantage in having None be in class 1. If None had a deterministic hash then tuples, frozensets etc consisting of objects with None as well as other objects with deterministic hash could all have deterministic hash. The behaviour of iteration order for sets would then be deterministic in the *narrow* sense that is referred to by the Python 3.5 docs above. Some have argued that the fact that some types have a seed dependent hash implies that None should not have a deterministic hash but this does not follow. It was known at the time that str and bytes were moved from class 1 to class 2 that it would be problematic to do so which is precisely why PYTHONHASHSEED was introduced. However PYTHONHASHSEED does not help here because None is not even in class 2 but rather class 3. If I was going to label any claim made by anyone in these threads as disingenuous then it would be the claim that None is somehow not "special". Firstly many types have deterministic hash so it isn't really that much of a special property. Secondly None is *clearly* a special value that is used everywhere! When I open the CPython interpreter there are already thousands of references to None before I have even done anything:
The motivation for this and other threads is to bring determinism in the *narrow* sense. Others (including me) have made references to other kinds of determinism that have derailed the threads by misunderstanding exactly what Yoni is referring to. The *stronger* sense of determinism would be useful if possible but it is not the intended topic of these threads. -- Oscar
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Tue, Nov 29, 2022 at 12:58 PM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
It does make your argument invalid though,
It makes that single sentence invalid, but the rest of my points still hold, e.g. the language makes no guarantee about hash consistency between executions, set order is not guaranteed, etc. are all still valid points. And as I said earlier, I also agree with the points made in the issue you linked to, so I'm still -0 on this. -Brett
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
the language makes no guarantee about hash consistency between executions
because it's futile in the general case, even if objects were to get a serial `id` and hash by it for example, any change in the number of objects created across all of Python (including its builtin modules and various libraries unrelated to the user code) would make these hashes move. So it's not like it's even possible to require this generally for all objects. None of that makes deterministic structural hashing any less useful in practice, though. Besides, do other languages require it? Is it required for the language to behave in a manner that makes sense? Or maybe you think it's by pure accident that such an overwhelming majority of languages and software libraries implement/use deterministic hashing functions for primitive types or aggregates that consist of such types? I can't figure out if you think it's actually a bad property for the language to have, or really just arguing that it's bad for the sake of it.
set order is not guaranteed Maybe not. In practice it has fully deterministic behavior, always has across all versions of Python since its inception. I don't care about what the order is, only that it's deterministic, and it is.
Rejecting my change because someone can technically get away with breaking this, after 30+ years seems highly suspect. Imagine I came in with 0.5% perf improvement, you would reject it citing that Python's requirements do not mandate that sets have good performance, and also that since they don't, someone else might come in with a change that slows down sets by an arbitrary amount, so there's no reason to believe my change will help at all Yes, if we tried really hard, we could always make the language worse. That's a pretty awful reason to reject the change though.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Thu, 1 Dec 2022 at 17:26, Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
For the record, Jython DOES use sequential numbers for ids. And it doesn't reuse them even if the objects are disposed of.
IIRC an id is not assigned to an object until one is requested.
So it's not like it's even possible to require this generally for all objects.
Well, I mean, in theory you could require that objects whose hash isn't otherwise defined get given the hash of zero. That doesn't violate any of the actual rules of hashes, but it does make those hashes quite suboptimal :) It's interesting how id() and hash() have opposite requirements (id must return a unique number among concurrently-existing objects, hash must return the same number among comparing-equal objects), yet a hash can be built on an id.
Determinism is usually the easiest option. True randomness takes a lot of effort compared to a deterministic PRNG, hence web servers having true entropy but old game consoles relying on PRNGs. Whether determinism is fundamentally good or fundamentally bad depends heavily on context. Question: Is the metaquestion "is determinism good" deterministic (ie can it be answered entirely from predictable facts), or is it itself entropic? I believe the former, but I'm curious if anyone disagrees! ChrisA
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
Whether determinism is fundamentally good or fundamentally bad depends heavily on context.
Agreed 100%. Unfortunately in Python, you cannot choose your hashing function depending on context. Also, once you've decided to violate determinism somewhere, it's gone. There is no way, in the general case, to bring it back. That's why it's important not to violate it willy-nilly in a manner that cannot even be prevented by users who _want_ their programs to exhibit deterministic behavior.
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Thu, 1 Dec 2022 at 06:56, Chris Angelico <rosuav@gmail.com> wrote:
This also demonstrates a significant reason why None is special: it's a singleton that only compares equal to itself. The reason for using id for hash in other cases is to make different instances have different hashes but there is only ever one instance of None. A singleton class can have a hash function that matches identity based equality without using id: any constant hash function will do. -- Oscar
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Sun, Nov 27, 2022 at 11:36 AM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
But the hash of an object is not guaranteed to be stable by the language, so I would argue someone expecting that is expected to convert random-access data structures to ones that are consistent when necessary (e.g. sorted lists).
I don't see why the hashing within a dict needs to be consistent as that's not a guarantee we make with Python.
If I remember correctly, PYTHONHASHSEED was added to help folks migrate when we added randomness to hashing as they had accidentally come to expect a consistent iteration order on dictionary keys. I wouldn't take its existence to suggest that PYTHONHASHSEED is meant to make **all** hashing consistent (e.g. people who implement their own __hash__ don't have to follow that expectation).
All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost.
That's actually by design. Sets are not meant to be deterministic conceptually as they are essentially a bag of stuff. If you want deterministic ordering you should convert it to a list and sort the list.
I personally agree with the arguments made in the issue, so I'm afraid I don't' support making the change as we worked hard to stop people from relying on consistent hashing/iteration from random-access data structures like dict and set. -Brett
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, 29 Nov 2022 at 09:51, Brett Cannon <brett@python.org> wrote:
... we worked hard to stop people from relying on consistent hashing/iteration from random-access data structures like dict and set.
Say what? Who's been working hard to stop people from relying on consistent iteration order for a dict? ChrisA
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
What does "sets are not meant to be deterministic" even mean? Mathematically speaking sets are not meant to be ordered in any particular way but a computational implementation has to have some order and there is no reason to prefer non-deterministic order in general. Actually determinism in a computational context is usually a very valuable feature. I find it hard to see why non-determinism is "by design". Also it isn't usually possible to sort a list containing None: In [9]: sorted([None, 1, 2]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-9-344383189210> in <module> ----> 1 sorted([None, 1, 2]) TypeError: '<' not supported between instances of 'int' and 'NoneType' It would be useful to have a straight-forward way to sort a set into a deterministic ordering but no such feature exists after the Py3K changes (sorted used to do this in Python 2.x). -- Oscar
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
To stir up some more fire, I would personally be fine with sets having the same ordering guarantees as dicts, *IF* it can be done without performance degradations. So far nobody has come up with a way to ensure that. "Sets weren't meant to be deterministic" sounds like a remnant of the old philosophy, where we said the same about dicts -- until they became deterministic without slowing down, and then everybody loved it. Regarding hash(None), I think that there is something to be said for making that stable, and the arguments against it feel like rationalizations for FUD. We've survived way larger controversies. I also note that hash(()) is apparently stable. On Mon, Nov 28, 2022 at 3:16 PM Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/834c5/834c5b8c3073c47a0888f391ecdd8a1c8e3859b6" alt=""
On 29/11/22 12:51 pm, Guido van Rossum wrote:
I got the impression that there were some internal language reasons to want stable dicts, e.g. so that the class dict passed to __prepare__ preserves the order in which names are assigned in the class body. Are there any such use cases for stable sets? -- Greg
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
Nah, `__prepare__` very much predates stable dicts and that problem was solved differently. On Mon, Nov 28, 2022 at 4:46 PM Greg Ewing <gcewing@snap.net.nz> wrote:
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 01:34:54PM +1300, Greg Ewing wrote:
Some people wanted order preserving kwargs, I think for web frameworks. There was even a discussion for a while about using OrderedDict for kwargs and leaving dicts unordered. For me, the biggest advantage of preserving input order in dicts is that it is now easier to write doctests using the dict's repr. It would be nice to be able to do the same for sets, but not nice enough to justify making them bigger or slower. -- Steve
data:image/s3,"s3://crabby-images/f3aca/f3aca73bf3f35ba204b73202269569bd49cd2b1e" alt=""
On Mon, Nov 28, 2022 at 6:45 PM Steven D'Aprano <steve@pearwood.info> wrote:
See https://peps.python.org/pep-0468/ (kwargs) and https://peps.python.org/pep-0520/ (class definition body). I re-implemented OrderedDict in C for this purpose. Literally right after I had finished that, Inada-san showed up with his compact dict implementation. Many of us were at the first core sprint at the time and there was a lot of excitement about compact dict. It was merged right away (for 3.6) and there was quick agreement that we could depend on dict insertion ordering internally (for a variety of use cases, IIRC). Thus, suddenly both my PEPs were effectively implemented, so we marked them as approved and moved on. FWIW, making the insertion ordering an official part of the language happened relatively soon afterward, though for 3.7, not 3.6. [1] I'm pretty sure there's a python-dev thread about that. The stdtypes docs were updated [2] soon after, and we finally got around to updating the language [3] a couple years later. -eric [1] https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights [2] https://bugs.python.org/issue33609 [3] https://bugs.python.org/issue39879
data:image/s3,"s3://crabby-images/a5852/a58528fd9ca9a821e9124b95ff002f6460f0f518" alt=""
For what it's worth, as a user of the language I would like sets to behave as much as possible as-if they were basically dicts that map all elements to `()`. That way I'd have to keep one less mental model in my head. I deliberately say 'as-if' because when I'm a user of the language, I don't care how it's implemented. (Just like I don't have to care as a user that we have (at least) two different ways dicts are represented internally.)
data:image/s3,"s3://crabby-images/e1c23/e1c23d411ff6b304943e192120a5724cbc381a3e" alt=""
On 29/11/2022 00:51, Guido van Rossum wrote:
Hi all, hi Guido, Just as a data point, PyPy's sets have been using the same stable ordering like dicts since our introduction of insertion ordered dicts. We have a number of other set optimizations so it's not an apple-to-apple comparison, but still. Cheers, Carl Friedrich
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Mon, Nov 28, 2022 at 11:13:34PM +0000, Oscar Benjamin wrote:
On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
I'm not Brett, so I'm not answering for him, but many people (sometimes including me) misuse "not deterministic" to refer to set order and the old dict order. Of course set order is deterministic, it is determined by the combination of the set implementation, the hashing algorithm, and the history of the set -- all the items that have ever appeared in the set (both those removed and those that remain). Set order is deterministic in the same way that roulette wheels are deterministic. We don't have a good term for this: the items don't appear in random order. But it is not predictable either, except in very special cases. "Arbitrary" is not right either, since that implies we can easily choose whatever set order we want. I think that physicists call this "deterministic chaos" https://www.quora.com/Theoretical-Physics-What-is-deterministic-but-unpredic... (sorry for the quora link) so I guess we might say that set iteration order is "deterministically chaotic" if you want to be precise, but life is too short for that level of pedantry (and that's coming from me, a Grade A Pedant *wink*) so people often describe it as random, arbitrary, or non-deterministic when it's none of those things :-). Maybe we could call set order "pseudo-random". Getting back to the design part, I think that what Brett is trying to get across is not that the chaotic set order is in and of itself a requirement, but that given the other requirements, chaotic set order is currently considered a necessary condition. As I understand it, we could make sets ordered, but only at the cost of space (much more memory) or time (slower) or both. I am sure that Guido is correct that **if** somebody comes up with a fast, efficient ordered set implementation that doesn't perform worse than the current implementation, we will happily swap to giving sets a predictable order, as we did with dicts. (Practicality beats purity -- even if sets are *philosophically* unordered, preserving input order is too useful to give up unless we gain something in return.) But I don't think it is fair or kind to call Brett's argument FUD. At the very least it is uncharitable interpretation of Brett's position.
`sorted()` works fine on homogeneous sets. It is only heterogeneous sets that are a problem, and in practice, that usually means None mixed in with some other type. -- Steve
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Tue, 29 Nov 2022 at 01:33, Steven D'Aprano <steve@pearwood.info> wrote:
Let's split this into two separate questions: 1. Is it *innately* good that set order is non-deterministic? 2. Are there some other reasons why it is good to choose a model that implies *non-deterministic* set order? The answer to 1. is emphatically NO. In fact the question itself is badly posed: why are we even asking about "set order" rather than the benefits of determinism in general? If I want my code to be deterministic then that's just something that I want regardless of whether sets, dicts, floats etc are involved. As for point 2. the fact that sets are currently non-deterministic is actually a relatively new thing in Python. Before hash-randomisation set and dict order *was* deterministic but with an arbitrary order. That was only changed because of a supposed security issue with hash collisions. Prior to that it was well understood that determinism was beneficial (honestly I don't understand why I have to state this point explicitly: determinism is almost always best in our context). Please everyone don't confuse arbitrary order, implementation defined order and non-deterministic order. There is no reason why sets in Python need to have a *non-deterministic* order or at least why there shouldn't be a way to control that. There is no performance penalty in making the order *deterministic*. (If you think that there might be a performance penalty then you haven't understood the suggestion!)
That is of course precisely the context for this thread! -- Oscar
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Tue, 29 Nov 2022 at 13:12, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
To clarify: The hash collision attack is a very real one, but specific to dictionaries of string keys, since there are quite a few ways for an attacker to send a string that gets automatically parsed into such a dictionary (eg web app frameworks where the request parameters are made available as a dictionary). But since that attack surface is *so* specific, randomization of non-string hashes is unimportant. ChrisA
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 02:07:34AM +0000, Oscar Benjamin wrote:
Let's split this into two separate questions:
Let's not. Your first question about non-deterministic set order being "innately good" is a straw man: as we've already discussed, set order is not non-deterministic (except in the informal sense of "hard to predict") and I don't think anyone is arguing in favour of keeping set order unpredictable even if there are faster, more compact, simpler implementations which preserve order. Talking about determinism is just muddying the waters. Sets are deterministic: their order is determinied by the implementation, the set history, and potentially any environmental sources of entropy used in address randomisation. Deterministic does not mean predictable. If we want to get this discussion onto a more useful path, we should start with a security question: The hash of None changes because of address randomisation. Address randomisation is enabled as a security measure. Are there any security consequences of giving None a constant hash? I can't see any, but then I couldn't see the security consequences of predictable string hashes until they were pointed out to me. So it would be really good to have some security experts comment on whether this is safe or not.
why are we even asking about "set order" rather than the benefits of determinism in general?
Because this entire discussion is motivated by the OP who wants consistent set order across multiple runs of his Python application. That's what he needs; having None hash to a constant value is just a means to that end. His sets contain objects whose hashes depend on `Optional[int]`, which means sometimes they include None, and when he runs under an interpreter built with address randomisation, the hash of None can change. Even if we grant None a constant hash, that still does not guarantee consistent set order across runs. At best, we might get such consistent order as an undocumented and changeable implementation detail, until we change the implementation of hashing, or of sets, or of something seemingly unrelated like address randomisation. -- Steve
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
I can't either. I can point out that the complexity attack via hash collisions is not possible on None specifically because it is a singleton, so: 1. There may only be one None in a dict at most, 2. For composite types that depend on None and other things, like say a tuple of string and an optional int, they become as resistant to attack as the same types without None, in this case string and int. 3. Python only explicitly added this security (hash randomization) feature to string and bytes hashes, and if your composite key type depends on those types, it will still be protected regardless of what None does
Not entirely. I do explain in my doc why there is a foundational reason why, once the choice was made to have None represent the `None` of Optional[T], it entered the family of types that you would expect to have deterministic behavior. And as the hashing function is intrinsic to types in Python, it is included in that.
ASLR will not cause any trouble so long as I keep objects with identity based hashing out of my sets. Or at least, any sets I later iterate on and run imperative code with side-effects under the loop. The possibility of disabling ASLR is not a reason to dismiss this change. No one guarantees a user of Python is in a position to make infosec related decisions on the computer system they are working on. Regarding the possibilities of hashes changing for the worse (in this regard) - sure. Anything is possible. Regarding sets - the set implementation is deterministic, and always has been (for decades). Yes, in theory it is possible that a new version or implementation of Python will make sets non-deterministic - shuffle themselves in the background independently from their history of operations, or what have you, and then my change loses all of its value. Like I said, anything is possible. But I think these last two points are essentially FUD. I made my proposal because I believe the FUD scenarios are strongly unlikely. (and even then, at worst we end up with a "practically useless" behavior on None, that can also be freely reverted along with such other changes anyway)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
[Oscar Benjamin]
(If you think that there might be a performance penalty then you haven't understood the suggestion!)
Then I don't understand the question, and I will refrain from participating further in this discussion. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
Looks like it's just miscommunication. There is the original proposal I made, strictly about how None is ought to be hashed, and then there is the separate topic of changing the stability properties of iteration on sets, and whether that can be made with/without a performance regression...
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
It does make your argument invalid though, since it's based on this assumption that I was asking for a requirement on iteration order (e.g. like dict's iteration order = insertion order guarantee), which is not the case. Again, determinism means that given all input data and commands fed to a data structure is the same, it will arrive at the same observable state, any time you start from scratch and replay this workload. In the context of sets, "all input data" includes the hashing function itself, and "observable state" also includes the order in which items will be returned if iterated. Note that there is NO requirement here on what that order might be. Under this definition, sets in Python are deterministic, and _always_ have been. And even outside of Python, there are aren't many cases where people willingly want to use data structures with non deterministic behavior. It usually involves concurrency (in the form of multithreading) and extreme performance requirements. And it's never the "standard" choice even in languages that do offer this. Determinism is generally considered as a valuable property in computation, at least when it is feasible to maintain it.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
Yoni, I think this answer is disingenious. Over on the Discuss threads, you have made it clear that the primary reason why you want hash(None) to return a constant value is so that set iteration order will be consistent from one run to another. Sure, you may *also* have some philosophical preference for None having a fixed hash, but you have barely mentioned that the same applies to Ellipsis and NotImplemented (and then only in passing), and that's not your driving motivation for this proposal. Let's consider a thought-experiment: suppose we agree to your proposal to make hash(None) return a constant, but at the same time modify the set iteration algorithm so that it starts from a different position each time you iterate, making set iteration order even more unpredictable and deterministically chaotic than it is now. Would that satisfy you? From your posts on Discuss, I don't expect that you will be happy with this outcome. Personally, and I don't expect that this will be a popular opinion, I wish that data structures that don't guarantee their iteration order would actively mix up their iteration order so that people wouldn't be tempted to rely on whatever accidental order they happen to get. Sure, it would cost a little bit of code to implement, but think of the savings in unproductive arguments and discussions like this one :-( It would also break the invariant that `repr(data) == repr(data)` but it is times like this that I feel that it would be worth it. -- Steve
data:image/s3,"s3://crabby-images/a3b9e/a3b9e3c01ce9004917ad5e7689530187eb3ae21c" alt=""
But it wouldn't -- equality of sets doesn't depend on iteration order. And even if you are talking about other types (this *is* a thought experiment), any type in which equality depends on order certainly shouldn't be jiggled in this way. And I think this is relevant -- I understand that it's handy that order be deterministic -- but even if it is made so in the current implementation, it still may not be consistent across different Python versions and implementations. Which is why I would think that relying on determinism of order is a bad idea anyway -- yes, the same computation should yield the same result -- but what does "same" mean? As far as sets go -- "same" is equality, and if you are checking anything else about them (such as iteration order), I think it's a mistake. Which I think is why Guido (or maybe someone else introduced the idea) was talking about order-preserving sets -- THAT would provide a larger benefit, and would change the specification of sets so that you *could* count on it across versions and implementations, like we can for the modern dict. So yes, the OP wasn't asking for that, but I think the point is that what the OP is asking for is not useful enough to do, whereas a larger proposal might be -- but only if it can be done without loss of performance, which, alas, no one knows how to do. Final confusion -- IIUC, even if the hash of None is made constant, you'd still need to turn off string hash munging to get deterministic order for sets with strings in them, yes? Which really seems to make this even less useful. -CHB On Tue, Nov 29, 2022 at 3:48 PM Steven D'Aprano <steve@pearwood.info> wrote:
-- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, 30 Nov 2022 at 10:48, Steven D'Aprano <steve@pearwood.info> wrote:
Let's consider an equally-likely thought experiment: suppose that every call to set.__iter__() causes a short busy-wait, spinning the CPU for a little while for absolutely no purpose. There's nothing in the docs that mandates performance, so it would be perfectly reasonable for a future version of Python to do this, right? But it's not something that I would consider *likely*. And, in fact, modifying the iteration algorithm to be completely unpredictable would have the same cost: an RNG lookup for no value whatsoever. It's notable that set() is about the only data type where this is possible. You can't randomize dict iteration order (couldn't do that even before it was locked in, because iterating over keys and values was always required to return them in the same order), and a lot of arbitrary-order collections are built on top of dictionaries. So your thought experiment would be special-casing sets and making them unnecessarily difficult to work with, just so you can point to that arbitrariness and say "hah! GOTCHA! You silly fools shouldn't have been depending on iteration order!!".
Which data structures are those? Please enumerate.
Sure, it would cost a little bit of code to implement, but think of the savings in unproductive arguments and discussions like this one :-(
So basically, it's a bit of useless code, a bit of useless CPU usage, all just so you can win an argument. Makes sense. ChrisA
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
I stand by what I said, there is absolutely nothing disingenious about it.
Again, for the n'th time, read what I wrote above: Under this definition, sets in Python are deterministic, and _always_ have been. I was just stating a fact, same as it is in all other languages I know. You are welcome to verify it by the way, if you can. That's why I don't request any change from set. It works fine and always has.
I'd also change other sentinels to hash to a constant if it were up to me to decide, but there is no practical significance to that so I don't bother dying on that hill (especially now, given the mob reaction). That doesn't make my argument inconsistent in any way. Sorry.
I address this already. I said it is a strongly unlikely FUD scenario. Sets have been behaving deterministically for decades now, across all languages. There's a pretty good reason for that, even if this reason doesn't happen to be codified into the requirements of Python. So as I said, I will take my chances. Again it's something I already addressed here:
_Please_ stop reiterating the same argument over and over and pretending I didn't already make a case against it. It doesn't help, just turns this discussion into a shouting match.
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
I don't think it is disingenuous. There are just a lot of people talking past each other and not quite understanding what each person means because there is confusion about even the intended meaning of terms like "deterministic". I will expand here with enough detail that we should hopefully be able to avoid misunderstanding each other. There are probably other places where you could find mentions of this in the docs but I just took a quick look in the Python 3.5 docs (before hash randomisation) to find this mention of dictionary iteration order: https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects What it says is """ Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. """ The key point is the use of the term "non-random" which here is intended to mean that although no particular ordering is guaranteed you can expect to rerun the same program and get the same result deterministically. A different version or implementation of Python might give a different order but rerunning the same program twice without changing anything should give the same result even if that result depends in some way on the iteration order of some dictionaries. I can't immediately find a similar statement about sets but in practice the same behaviour applied to sets as well. Note carefully that it is this *narrow* form of determinism that Yoni is interested in. Of course there are some caveats to this and the obvious one is that this statement does not apply if there are some objects that use identity based hashing so this is not deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) Running this gives: $ python3.5 t.py A(2) A(1) $ python3.5 t.py A(1) A(2) On the other hand if all of the hashes themselves are deterministic then the program as a whole will be as well so this is deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data def __hash__(self): return hash(self.data) def __eq__(self): return self.data == other.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) $ python3.5 t.py A(1) A(2) $ python3.5 t.py A(1) A(2) So we have two classes of hashable objects: 1. Those with deterministic hash 2. Those with non-deterministic hash A program that avoids depending on the iteration order of sets or dicts containing objects with non-deterministic hash could be deterministic. It is not the case that the program would depend on the iteration order for its *correctness* but just that the behaviour of the program is *reproducible* which is useful in various ways e.g.: - You could say to someone else "run this code with CPython 3.5 and you should be able to reproduce exactly what I see when I run the program". It is common practice e.g. in scientific programming to record things like random seeds so that someone else can precisely reproduce the results shown in a paper or some other work and this in general requires that it is at least possible to make everything deterministic. - When debugging it is useful to be able to reproduce an error condition precisely. Debugging non-deterministic failures can be extremely difficult. In the same way that you might want to reproduce correctly functioning code it is also very useful to be able to reproduce bugs. I can list more examples but really it shouldn't be necessary to justify from first principles why determinism in programming is usually a good thing. There can be reasons sometimes why determinism is undesired or cannot or should not be guaranteed. It should not be controversial though to say that all things being equal determinism is usually a desirable feature and should be preferred by default. I don't think that the 3.5 docs I quoted above used the words "non-random" casually: it was an intended feature and people were aware that breaking that behaviour would be problematic in many situations. Of course in Python 3.6 this determinism was broken with the introduction of hash randomisation for strings. It was considered that for security purposes it would be better to have some internal non-deterministic behaviour to guard against attackers. Specifically the hashes of three types (str, bytes and datetime) were made non-deterministic between subsequent CPython processes. The effect was not only to change purely internal state though but also the observable iteration order of dicts and sets which became non-deterministic from one run of CPython to another. It was anticipated at the time that this might be problematic in some situations (it certainly was!) and so an environment variable PYTHONHASHSEED was introduced in order to restore determinism for cases that needed it. So now if you want to have reproducible behaviour with Python 3.6+ you also need to fix PYTHONHASHSEED as well as avoiding the use of other types of non-deterministically hashable objects in sets and dicts. This is something that I have personally used mainly for the reproducibility of rare bugs e.g. "to reproduce this you should run the following Python code using commit abc123 under CPython 3.6 and with PYTHONHASHSEED=1234". Subsequently in Python 3.7 dict iteration order was changed to make it always deterministic by having it not depend on the hash values at all, with the order depending on the order of insertions into the dict instead. This introduced a *stronger* guarantee of determinism: now the ordering could be expected to be reproducible even with different versions and implementations of Python. For many this seemed to have resolved the problems of undefined, implementation-defined etc ordering. However this only applied to dicts and not sets and as of Python 3.12 any issues about deterministic ordering still remain wherever sets are used. The introduction of hash randomisation means that since Python 3.6 there are now three classes of hashable objects: 1. Objects with deterministic hash (int etc) 2. Objects with non-deterministic hash that can be controlled with PYTHONHASHSEED (bytes, str and datetime). 3. Objects with non-deterministic hash that cannot be controlled (id-based hashing). The question in this thread and others is which of these three classes None should belong to. Although None is just one particular value it is a very commonly used value and its non-deterministic hash can creep through to affect the hash of larger data structures that use recursive hash calls (e.g. a tuple of tuples of ... that somewhere contains None). Also certain situations such as a type hint like Optional[T] as referred to by the OP necessarily use None: $ python -c 'from typing import Optional; print(hash(Optional[int]))' -7631587549837930667 $ python -c 'from typing import Optional; print(hash(Optional[int]))' -6488475102642892491 Somehow this affects frozen dataclasses but I haven't used those myself so I won't demonstrate how to reproduce the problem with them. Here is a survey of types from builtins: NoneType bool bytearray bytes complex dict ellipsis enumerate filter float frozenset int list map memoryview object range reversed set slice str tuple type zip We can divide these into the three classes described above plus the non-hashable types (I haven't checked this in the code, but just experimenting with calling hash): 1. Deterministic hash: bool, complex, float, frozenset, int, range, tuple 2. Hash depends on hash seed: str, bytes 3. Hash depends on id: NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip 4. Non-hashable: bytearray, dict, list, set, slice The question here is whether None belongs in class 3 or class 1. To me it seems clear that there is no advantage in having None be in class 3 except perhaps to save a few simple lines of code: https://github.com/python/cpython/pull/99541/files There is however a worthwhile advantage in having None be in class 1. If None had a deterministic hash then tuples, frozensets etc consisting of objects with None as well as other objects with deterministic hash could all have deterministic hash. The behaviour of iteration order for sets would then be deterministic in the *narrow* sense that is referred to by the Python 3.5 docs above. Some have argued that the fact that some types have a seed dependent hash implies that None should not have a deterministic hash but this does not follow. It was known at the time that str and bytes were moved from class 1 to class 2 that it would be problematic to do so which is precisely why PYTHONHASHSEED was introduced. However PYTHONHASHSEED does not help here because None is not even in class 2 but rather class 3. If I was going to label any claim made by anyone in these threads as disingenuous then it would be the claim that None is somehow not "special". Firstly many types have deterministic hash so it isn't really that much of a special property. Secondly None is *clearly* a special value that is used everywhere! When I open the CPython interpreter there are already thousands of references to None before I have even done anything:
The motivation for this and other threads is to bring determinism in the *narrow* sense. Others (including me) have made references to other kinds of determinism that have derailed the threads by misunderstanding exactly what Yoni is referring to. The *stronger* sense of determinism would be useful if possible but it is not the intended topic of these threads. -- Oscar
data:image/s3,"s3://crabby-images/e87f3/e87f3c7c6d92519a9dac18ec14406dd41e3da93d" alt=""
On Tue, Nov 29, 2022 at 12:58 PM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
It does make your argument invalid though,
It makes that single sentence invalid, but the rest of my points still hold, e.g. the language makes no guarantee about hash consistency between executions, set order is not guaranteed, etc. are all still valid points. And as I said earlier, I also agree with the points made in the issue you linked to, so I'm still -0 on this. -Brett
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
the language makes no guarantee about hash consistency between executions
because it's futile in the general case, even if objects were to get a serial `id` and hash by it for example, any change in the number of objects created across all of Python (including its builtin modules and various libraries unrelated to the user code) would make these hashes move. So it's not like it's even possible to require this generally for all objects. None of that makes deterministic structural hashing any less useful in practice, though. Besides, do other languages require it? Is it required for the language to behave in a manner that makes sense? Or maybe you think it's by pure accident that such an overwhelming majority of languages and software libraries implement/use deterministic hashing functions for primitive types or aggregates that consist of such types? I can't figure out if you think it's actually a bad property for the language to have, or really just arguing that it's bad for the sake of it.
set order is not guaranteed Maybe not. In practice it has fully deterministic behavior, always has across all versions of Python since its inception. I don't care about what the order is, only that it's deterministic, and it is.
Rejecting my change because someone can technically get away with breaking this, after 30+ years seems highly suspect. Imagine I came in with 0.5% perf improvement, you would reject it citing that Python's requirements do not mandate that sets have good performance, and also that since they don't, someone else might come in with a change that slows down sets by an arbitrary amount, so there's no reason to believe my change will help at all Yes, if we tried really hard, we could always make the language worse. That's a pretty awful reason to reject the change though.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Thu, 1 Dec 2022 at 17:26, Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
For the record, Jython DOES use sequential numbers for ids. And it doesn't reuse them even if the objects are disposed of.
IIRC an id is not assigned to an object until one is requested.
So it's not like it's even possible to require this generally for all objects.
Well, I mean, in theory you could require that objects whose hash isn't otherwise defined get given the hash of zero. That doesn't violate any of the actual rules of hashes, but it does make those hashes quite suboptimal :) It's interesting how id() and hash() have opposite requirements (id must return a unique number among concurrently-existing objects, hash must return the same number among comparing-equal objects), yet a hash can be built on an id.
Determinism is usually the easiest option. True randomness takes a lot of effort compared to a deterministic PRNG, hence web servers having true entropy but old game consoles relying on PRNGs. Whether determinism is fundamentally good or fundamentally bad depends heavily on context. Question: Is the metaquestion "is determinism good" deterministic (ie can it be answered entirely from predictable facts), or is it itself entropic? I believe the former, but I'm curious if anyone disagrees! ChrisA
data:image/s3,"s3://crabby-images/fa3da/fa3da0f8140aae6d69fc127fbc1af03b30fd44a4" alt=""
Whether determinism is fundamentally good or fundamentally bad depends heavily on context.
Agreed 100%. Unfortunately in Python, you cannot choose your hashing function depending on context. Also, once you've decided to violate determinism somewhere, it's gone. There is no way, in the general case, to bring it back. That's why it's important not to violate it willy-nilly in a manner that cannot even be prevented by users who _want_ their programs to exhibit deterministic behavior.
data:image/s3,"s3://crabby-images/5f8b2/5f8b2ad1b2b61ef91eb396773cce6ee17c3a4eca" alt=""
On Thu, 1 Dec 2022 at 06:56, Chris Angelico <rosuav@gmail.com> wrote:
This also demonstrates a significant reason why None is special: it's a singleton that only compares equal to itself. The reason for using id for hash in other cases is to make different instances have different hashes but there is only ever one instance of None. A singleton class can have a hash function that matches identity based equality without using id: any constant hash function will do. -- Oscar
participants (11)
-
Brett Cannon
-
Carl Friedrich Bolz-Tereick
-
Chris Angelico
-
Christopher Barker
-
Eric Snow
-
Greg Ewing
-
Guido van Rossum
-
Matthias Görgens
-
Oscar Benjamin
-
Steven D'Aprano
-
Yoni Lavi