[Python-Dev] Dataclasses and correct hashability

Sat Feb 3 15:05:07 EST 2018

On Fri, Feb 2, 2018 at 10:25 PM Nick Coghlan <ncoghlan at gmail.com> wrote:

>
>
> On 3 Feb. 2018 1:09 am, "Eric V. Smith" <eric at trueblade.com> wrote:
>
>
> The problem with dropping hash=True is: how would you write __hash__
> yourself? It seems like a bug magnet if you're adding fields to the class
> and forget to update __hash__, especially in the presence of per-field
> hash=False and eq=False settings. And you'd need to make sure it matches
> the generated __eq__ (if 2 objects are equal, they need to have the same
> hash value).
>
>
> I think anyone that does this needs to think *very* carefully about how
> they do it, and offering both "hash=True" and "frozen=True" is an
> attractive nuisance that means people will write "hash=True" when what they
> wanted was "frozen=True".
>

> In particular, having to work out how write a maintainable "__hash__" will
> encourage folks to separate out the hashed fields as a separate frozen
> subrecord or base class.
>
> If this proves to be an intolerable burden then the short hand spelling
> could be added back in 3.8, but once we ship it we're going to be stuck
> with explaining the interactions indefinitely.
>

+1 Nick put words to my chief concerns.

It is easy for an author see hash=True in existing code somewhere (cargo
culting) and assume it does what they want, or quickly glance at the the
API and see "hash=True" without actually taking the time to understand the
implications of that to see that the parameter named "frozen" is the one
they are supposed to want that _safely_ makes their dataclass properly
hashable, not the more attractive parameter named "hash" that enables
dangerous behavior.

Forcing people who need a __hash__ method to write it explicitly sounds
like a good thing to me.  I am not at all worried about someone forgetting
to add a new field to an implementation of the __hash__ method when adding
a new field, the fields and __hash__ method are all defined in the same
place in the code.

I expect someone with a common need for always having a __hash__ method
will produce a library on top of dataclasses that implements something like
our current hash=True behavior.  If that kind of thing turns out to be
widely used, we can reintroduce the feature in dataclasses in 3.8 or later,
informed by what we see practical uses actually doing.

In my practical experience, people writing Python code do not need to learn
and understand the intricacies of what it means to have a __hash__ method,
be hashable, or "frozen".  We intentionally warn people against writing
dunder methods other than __init__ in their code as they are often power
features with less obvious semantics than it may seem at first glance
making such code harder to maintain.

Even calling the parameter "hash=" and saying it adds a __hash__ method as
the PEP currently does seems to launder the danger, washing away the
"dunder smell" that adding a special considerations __hash__ method carries.

The PEP (and presumably forthcoming dataclasses module documentation) says
"This is a specialized use case and should be considered carefully" which I
agree with.  But any time we suggest that in an API, how about having the
API name make it clear that this is special and not to be done lightly?  I
guess i'm arguing against using "hash=" as the arg name in favor of
"danger_there_be_vorpal_rabbits_hash_me_maybe=" or something more usefully
similar if we're going to have it.

-gps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180203/2d57e832/attachment.html>