[Python-Dev] Re: A proposal to modify `None` so that it hashes to a constant

Nov. 30, 2022

      On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
...
On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
...
It does make your argument invalid though, since it's based on this
assumption that I was asking for a requirement on iteration order
(e.g. like dict's iteration order = insertion order guarantee), which
is not the case.
Yoni, I think this answer is disingenious.
I don't think it is disingenuous. There are just a lot of people
talking past each other and not quite understanding what each person
means because there is confusion about even the intended meaning of
terms like "deterministic". I will expand here with enough detail that
we should hopefully be able to avoid misunderstanding each other.

There are probably other places where you could find mentions of this
in the docs but I just took a quick look in the Python 3.5 docs
(before hash randomisation) to find this mention of dictionary
iteration order:
https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects

What it says is
"""
Keys and values are iterated over in an arbitrary order which is
non-random, varies across Python implementations, and depends on the
dictionary’s history of insertions and deletions.
"""
The key point is the use of the term "non-random" which here is
intended to mean that although no particular ordering is guaranteed
you can expect to rerun the same program and get the same result
deterministically. A different version or implementation of Python
might give a different order but rerunning the same program twice
without changing anything should give the same result even if that
result depends in some way on the iteration order of some
dictionaries. I can't immediately find a similar statement about sets
but in practice the same behaviour applied to sets as well. Note
carefully that it is this *narrow* form of determinism that Yoni is
interested in.

Of course there are some caveats to this and the obvious one is that
this statement does not apply if there are some objects that use
identity based hashing so this is not deterministic:

class A:
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return 'A(%s)' % self.data

a1 = A(1)
a2 = A(2)

for a in {a1, a2}:
    print(a)

Running this gives:

$ python3.5 t.py
A(2)
A(1)
$ python3.5 t.py
A(1)
A(2)

On the other hand if all of the hashes themselves are deterministic
then the program as a whole will be as well so this is deterministic:

class A:
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return 'A(%s)' % self.data
    def __hash__(self):
        return hash(self.data)
    def __eq__(self):
        return self.data == other.data

a1 = A(1)
a2 = A(2)

for a in {a1, a2}:
    print(a)

$ python3.5 t.py
A(1)
A(2)
$ python3.5 t.py
A(1)
A(2)

So we have two classes of hashable objects:

1. Those with deterministic hash
2. Those with non-deterministic hash

A program that avoids depending on the iteration order of sets or
dicts containing objects with non-deterministic hash could be
deterministic. It is not the case that the program would depend on the
iteration order for its *correctness* but just that the behaviour of
the program is *reproducible* which is useful in various ways e.g.:

- You could say to someone else "run this code with CPython 3.5 and
you should be able to reproduce exactly what I see when I run the
program". It is common practice e.g. in scientific programming to
record things like random seeds so that someone else can precisely
reproduce the results shown in a paper or some other work and this in
general requires that it is at least possible to make everything
deterministic.

- When debugging it is useful to be able to reproduce an error
condition precisely. Debugging non-deterministic failures can be
extremely difficult. In the same way that you might want to reproduce
correctly functioning code it is also very useful to be able to
reproduce bugs.

I can list more examples but really it shouldn't be necessary to
justify from first principles why determinism in programming is
usually a good thing. There can be reasons sometimes why determinism
is undesired or cannot or should not be guaranteed. It should not be
controversial though to say that all things being equal determinism is
usually a desirable feature and should be preferred by default. I
don't think that the 3.5 docs I quoted above used the words
"non-random" casually: it was an intended feature and people were
aware that breaking that behaviour would be problematic in many
situations.

Of course in Python 3.6 this determinism was broken with the
introduction of hash randomisation for strings. It was considered that
for security purposes it would be better to have some internal
non-deterministic behaviour to guard against attackers. Specifically
the hashes of three types (str, bytes and datetime) were made
non-deterministic between subsequent CPython processes. The effect was
not only to change purely internal state though but also the
observable iteration order of dicts and sets which became
non-deterministic from one run of CPython to another. It was
anticipated at the time that this might be problematic in some
situations (it certainly was!) and so an environment variable
PYTHONHASHSEED was introduced in order to restore determinism for
cases that needed it.

So now if you want to have reproducible behaviour with Python 3.6+ you
also need to fix PYTHONHASHSEED as well as avoiding the use of other
types of non-deterministically hashable objects in sets and dicts.
This is something that I have personally used mainly for the
reproducibility of rare bugs e.g. "to reproduce this you should run
the following Python code using commit abc123 under CPython 3.6 and
with PYTHONHASHSEED=1234".

Subsequently in Python 3.7 dict iteration order was changed to make it
always deterministic by having it not depend on the hash values at
all, with the order depending on the order of insertions into the dict
instead. This introduced a *stronger* guarantee of determinism: now
the ordering could be expected to be reproducible even with different
versions and implementations of Python. For many this seemed to have
resolved the problems of undefined, implementation-defined etc
ordering. However this only applied to dicts and not sets and as of
Python 3.12 any issues about deterministic ordering still remain
wherever sets are used.

The introduction of hash randomisation means that since Python 3.6
there are now three classes of hashable objects:

1. Objects with deterministic hash (int etc)
2. Objects with non-deterministic hash that can be controlled with
PYTHONHASHSEED (bytes, str and datetime).
3. Objects with non-deterministic hash that cannot be controlled
(id-based hashing).

The question in this thread and others is which of these three classes
None should belong to. Although None is just one particular value it
is a very commonly used value and its non-deterministic hash can creep
through to affect the hash of larger data structures that use
recursive hash calls (e.g. a tuple of tuples of ... that somewhere
contains None). Also certain situations such as a type hint like
Optional[T] as referred to by the OP necessarily use None:

$ python -c 'from typing import Optional; print(hash(Optional[int]))'
-7631587549837930667
$ python -c 'from typing import Optional; print(hash(Optional[int]))'
-6488475102642892491

Somehow this affects frozen dataclasses but I haven't used those
myself so I won't demonstrate how to reproduce the problem with them.

Here is a survey of types from builtins:

NoneType bool bytearray bytes complex dict ellipsis enumerate filter
float frozenset int list map memoryview object range reversed set
slice str tuple type zip

We can divide these into the three classes described above plus the
non-hashable types (I haven't checked this in the code, but just
experimenting with calling hash):

1. Deterministic hash:
bool, complex, float, frozenset, int, range, tuple

2. Hash depends on hash seed:
str, bytes

3. Hash depends on id:
NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip

4. Non-hashable:
bytearray, dict, list, set, slice

The question here is whether None belongs in class 3 or class 1. To me
it seems clear that there is no advantage in having None be in class 3
except perhaps to save a few simple lines of code:
https://github.com/python/cpython/pull/99541/files

There is however a worthwhile advantage in having None be in class 1.
If None had a deterministic hash then tuples, frozensets etc
consisting of objects with None as well as other objects with
deterministic hash could all have deterministic hash. The behaviour of
iteration order for sets would then be deterministic in the *narrow*
sense that is referred to by the Python 3.5 docs above.

Some have argued that the fact that some types have a seed dependent
hash implies that None should not have a deterministic hash but this
does not follow. It was known at the time that str and bytes were
moved from class 1 to class 2 that it would be problematic to do so
which is precisely why PYTHONHASHSEED was introduced. However
PYTHONHASHSEED does not help here because None is not even in class 2
but rather class 3.

If I was going to label any claim made by anyone in these threads as
disingenuous then it would be the claim that None is somehow not
"special". Firstly many types have deterministic hash so it isn't
really that much of a special property. Secondly None is *clearly* a
special value that is used everywhere! When I open the CPython
interpreter there are already thousands of references to None before I
have even done anything:
...
...
...
import sys
sys.getrefcount(None)
  4429
The motivation for this and other threads is to bring determinism in
the *narrow* sense. Others (including me) have made references to
other kinds of determinism that have derailed the threads by
misunderstanding exactly what Yoni is referring to. The *stronger*
sense of determinism would be useful if possible but it is not the
intended topic of these threads.

--
Oscar

[Python-Dev] Re: A proposal to modify `None` so that it hashes to a constant

Oscar Benjamin