
On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
It does make your argument invalid though, since it's based on this assumption that I was asking for a requirement on iteration order (e.g. like dict's iteration order = insertion order guarantee), which is not the case.
Yoni, I think this answer is disingenious.
I don't think it is disingenuous. There are just a lot of people talking past each other and not quite understanding what each person means because there is confusion about even the intended meaning of terms like "deterministic". I will expand here with enough detail that we should hopefully be able to avoid misunderstanding each other. There are probably other places where you could find mentions of this in the docs but I just took a quick look in the Python 3.5 docs (before hash randomisation) to find this mention of dictionary iteration order: https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects What it says is """ Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. """ The key point is the use of the term "non-random" which here is intended to mean that although no particular ordering is guaranteed you can expect to rerun the same program and get the same result deterministically. A different version or implementation of Python might give a different order but rerunning the same program twice without changing anything should give the same result even if that result depends in some way on the iteration order of some dictionaries. I can't immediately find a similar statement about sets but in practice the same behaviour applied to sets as well. Note carefully that it is this *narrow* form of determinism that Yoni is interested in. Of course there are some caveats to this and the obvious one is that this statement does not apply if there are some objects that use identity based hashing so this is not deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) Running this gives: $ python3.5 t.py A(2) A(1) $ python3.5 t.py A(1) A(2) On the other hand if all of the hashes themselves are deterministic then the program as a whole will be as well so this is deterministic: class A: def __init__(self, data): self.data = data def __repr__(self): return 'A(%s)' % self.data def __hash__(self): return hash(self.data) def __eq__(self): return self.data == other.data a1 = A(1) a2 = A(2) for a in {a1, a2}: print(a) $ python3.5 t.py A(1) A(2) $ python3.5 t.py A(1) A(2) So we have two classes of hashable objects: 1. Those with deterministic hash 2. Those with non-deterministic hash A program that avoids depending on the iteration order of sets or dicts containing objects with non-deterministic hash could be deterministic. It is not the case that the program would depend on the iteration order for its *correctness* but just that the behaviour of the program is *reproducible* which is useful in various ways e.g.: - You could say to someone else "run this code with CPython 3.5 and you should be able to reproduce exactly what I see when I run the program". It is common practice e.g. in scientific programming to record things like random seeds so that someone else can precisely reproduce the results shown in a paper or some other work and this in general requires that it is at least possible to make everything deterministic. - When debugging it is useful to be able to reproduce an error condition precisely. Debugging non-deterministic failures can be extremely difficult. In the same way that you might want to reproduce correctly functioning code it is also very useful to be able to reproduce bugs. I can list more examples but really it shouldn't be necessary to justify from first principles why determinism in programming is usually a good thing. There can be reasons sometimes why determinism is undesired or cannot or should not be guaranteed. It should not be controversial though to say that all things being equal determinism is usually a desirable feature and should be preferred by default. I don't think that the 3.5 docs I quoted above used the words "non-random" casually: it was an intended feature and people were aware that breaking that behaviour would be problematic in many situations. Of course in Python 3.6 this determinism was broken with the introduction of hash randomisation for strings. It was considered that for security purposes it would be better to have some internal non-deterministic behaviour to guard against attackers. Specifically the hashes of three types (str, bytes and datetime) were made non-deterministic between subsequent CPython processes. The effect was not only to change purely internal state though but also the observable iteration order of dicts and sets which became non-deterministic from one run of CPython to another. It was anticipated at the time that this might be problematic in some situations (it certainly was!) and so an environment variable PYTHONHASHSEED was introduced in order to restore determinism for cases that needed it. So now if you want to have reproducible behaviour with Python 3.6+ you also need to fix PYTHONHASHSEED as well as avoiding the use of other types of non-deterministically hashable objects in sets and dicts. This is something that I have personally used mainly for the reproducibility of rare bugs e.g. "to reproduce this you should run the following Python code using commit abc123 under CPython 3.6 and with PYTHONHASHSEED=1234". Subsequently in Python 3.7 dict iteration order was changed to make it always deterministic by having it not depend on the hash values at all, with the order depending on the order of insertions into the dict instead. This introduced a *stronger* guarantee of determinism: now the ordering could be expected to be reproducible even with different versions and implementations of Python. For many this seemed to have resolved the problems of undefined, implementation-defined etc ordering. However this only applied to dicts and not sets and as of Python 3.12 any issues about deterministic ordering still remain wherever sets are used. The introduction of hash randomisation means that since Python 3.6 there are now three classes of hashable objects: 1. Objects with deterministic hash (int etc) 2. Objects with non-deterministic hash that can be controlled with PYTHONHASHSEED (bytes, str and datetime). 3. Objects with non-deterministic hash that cannot be controlled (id-based hashing). The question in this thread and others is which of these three classes None should belong to. Although None is just one particular value it is a very commonly used value and its non-deterministic hash can creep through to affect the hash of larger data structures that use recursive hash calls (e.g. a tuple of tuples of ... that somewhere contains None). Also certain situations such as a type hint like Optional[T] as referred to by the OP necessarily use None: $ python -c 'from typing import Optional; print(hash(Optional[int]))' -7631587549837930667 $ python -c 'from typing import Optional; print(hash(Optional[int]))' -6488475102642892491 Somehow this affects frozen dataclasses but I haven't used those myself so I won't demonstrate how to reproduce the problem with them. Here is a survey of types from builtins: NoneType bool bytearray bytes complex dict ellipsis enumerate filter float frozenset int list map memoryview object range reversed set slice str tuple type zip We can divide these into the three classes described above plus the non-hashable types (I haven't checked this in the code, but just experimenting with calling hash): 1. Deterministic hash: bool, complex, float, frozenset, int, range, tuple 2. Hash depends on hash seed: str, bytes 3. Hash depends on id: NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip 4. Non-hashable: bytearray, dict, list, set, slice The question here is whether None belongs in class 3 or class 1. To me it seems clear that there is no advantage in having None be in class 3 except perhaps to save a few simple lines of code: https://github.com/python/cpython/pull/99541/files There is however a worthwhile advantage in having None be in class 1. If None had a deterministic hash then tuples, frozensets etc consisting of objects with None as well as other objects with deterministic hash could all have deterministic hash. The behaviour of iteration order for sets would then be deterministic in the *narrow* sense that is referred to by the Python 3.5 docs above. Some have argued that the fact that some types have a seed dependent hash implies that None should not have a deterministic hash but this does not follow. It was known at the time that str and bytes were moved from class 1 to class 2 that it would be problematic to do so which is precisely why PYTHONHASHSEED was introduced. However PYTHONHASHSEED does not help here because None is not even in class 2 but rather class 3. If I was going to label any claim made by anyone in these threads as disingenuous then it would be the claim that None is somehow not "special". Firstly many types have deterministic hash so it isn't really that much of a special property. Secondly None is *clearly* a special value that is used everywhere! When I open the CPython interpreter there are already thousands of references to None before I have even done anything:
import sys sys.getrefcount(None) 4429
The motivation for this and other threads is to bring determinism in the *narrow* sense. Others (including me) have made references to other kinds of determinism that have derailed the threads by misunderstanding exactly what Yoni is referring to. The *stronger* sense of determinism would be useful if possible but it is not the intended topic of these threads. -- Oscar