Dataclasses and correct hashability
There appears to be a critical omission from the current dataclass implementation: it does not make hash=True fields immutable. Per Python spec: "the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket)" Yet: import dataclasses @dataclasses.dataclass(hash=True) class A: foo: int = dataclasses.field(hash=True, compare=True) a = A(foo=1) s = set() s.add(a) # s == {a} a.foo = 2 print(a in s) print({a} == s} print(s == s) # prints False False True This looks to me like a clearly wrong behavior. Elvis
On 2/1/2018 7:34 PM, Elvis Pranskevichus wrote:
There appears to be a critical omission from the current dataclass implementation: it does not make hash=True fields immutable.
Per Python spec:
"the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket)"
Yet:
import dataclasses
@dataclasses.dataclass(hash=True) class A: foo: int = dataclasses.field(hash=True, compare=True)
a = A(foo=1)
s = set() s.add(a) # s == {a} a.foo = 2
print(a in s) print({a} == s} print(s == s)
# prints False False True
This looks to me like a clearly wrong behavior.
Elvis
Data classes do not protect you from doing the wrong thing. This is the same as writing: class A: def __init__(self, foo): self.foo = foo def __hash__(self): return hash((self.foo,)) You're allowed to override the parameters to dataclasses.dataclass for cases where you know what you're doing. Consenting adults, and all. Eric.
On 2/1/2018 8:17 PM, Eric V. Smith wrote:
On 2/1/2018 7:34 PM, Elvis Pranskevichus wrote:
There appears to be a critical omission from the current dataclass implementation: it does not make hash=True fields immutable.
Per Python spec:
"the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket)"
Yet:
import dataclasses
@dataclasses.dataclass(hash=True) class A: foo: int = dataclasses.field(hash=True, compare=True)
a = A(foo=1)
s = set() s.add(a) # s == {a} a.foo = 2
print(a in s) print({a} == s} print(s == s)
# prints False False True
This looks to me like a clearly wrong behavior.
Elvis
Data classes do not protect you from doing the wrong thing. This is the same as writing:
class A: def __init__(self, foo): self.foo = foo def __hash__(self): return hash((self.foo,))
You're allowed to override the parameters to dataclasses.dataclass for cases where you know what you're doing. Consenting adults, and all.
I should add: This is why you shouldn't override the default (hash=None) unless you know what the consequences are. Can I ask why you want to specify hash=True? Eric
On Thursday, February 1, 2018 8:21:03 PM EST Eric V. Smith wrote:
I should add: This is why you shouldn't override the default (hash=None) unless you know what the consequences are. Can I ask why you want to specify hash=True?
hash=None and compare=True leads to the same result, which, I think is even worse.
You're allowed to override the parameters to dataclasses.dataclass for cases where you know what you're doing. Consenting adults, and all.
I don't agree with this. You are comparing implicit dataclass behavior with an explicit shoot-in-the-foot __hash__() definition. Elvis
On 2/1/2018 8:29 PM, Elvis Pranskevichus wrote:
On Thursday, February 1, 2018 8:21:03 PM EST Eric V. Smith wrote:
I should add: This is why you shouldn't override the default (hash=None) unless you know what the consequences are. Can I ask why you want to specify hash=True?
hash=None and compare=True leads to the same result, which, I think is even worse.
Have you actually tried that?
@dataclass(hash=None) ... class A: ... foo: int = field(hash=True, compare=True) ... hash(A(2)) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'A'
I believe the default hash=None on the class decorator does right thing. Please provide a counter-example.
You're allowed to override the parameters to dataclasses.dataclass for cases where you know what you're doing. Consenting adults, and all.
I don't agree with this. You are comparing implicit dataclass behavior with an explicit shoot-in-the-foot __hash__() definition.
I don't recommend ever specifying the decorator hash= parameter unless you have an unusual use case, in which case it's on you to get it right. There was recently a long python-dev discussion on this issue. I need to update the PEP to reflect it, but the advice still stands: you almost always want to use the default hash=None. Do you have a use case for specifying a hash function on a class with mutable instances? Maybe you want frozen=True? Eric
On Thursday, February 1, 2018 8:37:41 PM EST Eric V. Smith wrote:
hash=None and compare=True leads to the same result, which, I think is even worse.
Have you actually tried that?
I meant this: @dataclasses.dataclass(hash=True) class A: foo: int = dataclasses.field(compare=True)
I don't recommend ever specifying the decorator hash= parameter unless you have an unusual use case, in which case it's on you to get it right.
In my experience this type of breakage is so subtle that people will happily write code lots of code like this without noticing. My main objection here is that the dataclass does not go far enough to prevent obviously wrong behaviour. Or it goes too far with the whole hash/ frozen distinction. Elvis
On 2 February 2018 at 11:49, Elvis Pranskevichus
In my experience this type of breakage is so subtle that people will happily write code lots of code like this without noticing. My main objection here is that the dataclass does not go far enough to prevent obviously wrong behaviour. Or it goes too far with the whole hash/ frozen distinction.
For 3.7, I think we should seriously considered just straight up disallowing the "hash=True, frozen=False" combination, and instead require folks to provide their own hash function in that case. "Accidentally hashable" (whether by identity or field hash) isn't a thing that data classes should be allowing to happen. If we did that, then the public "hash" parameter could potentially be dropped entirely for the time being - the replacement for "hash=True" would be a "def __hash__: ..." in the body of the class definition, and the replacement for "hash=False" would be "__hash__ = None" in the class body. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Friday, February 2, 2018 12:33:04 AM EST Nick Coghlan wrote:
For 3.7, I think we should seriously considered just straight up disallowing the "hash=True, frozen=False" combination, and instead require folks to provide their own hash function in that case. "Accidentally hashable" (whether by identity or field hash) isn't a thing that data classes should be allowing to happen.
If we did that, then the public "hash" parameter could potentially be dropped entirely for the time being - the replacement for "hash=True" would be a "def __hash__: ..." in the body of the class definition, and the replacement for "hash=False" would be "__hash__ = None" in the class body.
I think "frozen=True" should just imply hashability (by fields). You can always do "__hash__ = None" to opt out. I don't see the default hashability of an immutable object as a problem. tuples and frozensets are hashable after all. Elvis
On 2/2/2018 12:33 AM, Nick Coghlan wrote:
For 3.7, I think we should seriously considered just straight up disallowing the "hash=True, frozen=False" combination, and instead require folks to provide their own hash function in that case. "Accidentally hashable" (whether by identity or field hash) isn't a thing that data classes should be allowing to happen.
If we did that, then the public "hash" parameter could potentially be dropped entirely for the time being - the replacement for "hash=True" would be a "def __hash__: ..." in the body of the class definition, and the replacement for "hash=False" would be "__hash__ = None" in the class body.
attrs has the same behavior (if you ignore how dataclasses handles the cases where __hash__ or __eq__ already exist in the class definition). Here's what attrs says about adding __hash__ via hash=True: "Although not recommended, you can decide for yourself and force attrs to create one (e.g. if the class is immutable even though you didn’t freeze it programmatically) by passing True or not. Both of these cases are rather special and should be used carefully." The problem with dropping hash=True is: how would you write __hash__ yourself? It seems like a bug magnet if you're adding fields to the class and forget to update __hash__, especially in the presence of per-field hash=False and eq=False settings. And you'd need to make sure it matches the generated __eq__ (if 2 objects are equal, they need to have the same hash value). If we're going to start disallowing things, how about the per-field hash=True, eq=False case? However, I don't feel very strongly about this. As I've said, I expect the use cases for hash=True to be very, very rare. And now that we allow overriding __hash__ in the class body without setting hash=False, there aren't a lot of uses for hash=False, either. But we would need to think through how you'd get the behavior of hash=False with multiple inheritance, if that's what you wanted. Again, a very, very rare case. In all, I think we're better off documenting best practices and making them the default, like attrs does, and leave it to the programmer to follow them. I realize we're handing out footguns, the alternatives seem even more complex and are limiting. Eric
On Friday, February 2, 2018 10:08:43 AM EST Eric V. Smith wrote:
However, I don't feel very strongly about this. As I've said, I expect the use cases for hash=True to be very, very rare.
Why do you think that the requirement to make a dataclass hashable is a "very, very rare" requirement? The moment you want to use a dataclass a a dict key, or put it in a set, you need it to be hashable. Just put yourself in the shoes of an average Python developer. You try to put a dataclass in a set, you get a TypeError. Your immediate reaction is to add "hash=True". Things appear to work. Then, you, or someone else, decides to mutate the dataclass object and then you are looking at a very frustrating debug session.
In all, I think we're better off documenting best practices and making them the default, like attrs does, and leave it to the programmer to follow them. I realize we're handing out footguns
I don't think attrs doing the same thing is a valid justification. This is a major footgun that is very easy to trigger, and there's really no precedent in standard data types.
the alternatives seem even more complex and are limiting.
The alternative is simple and follows the design of other standard containers: immutable containers are hashable, mutable containers are not. @dataclass(frozen=False) gives you a SimpleNamespace-like and @dataclass(frozen=True) gives you a namedtuple-like. If you _really_ know what you are doing, then you can always declare an explicit __hash__.
The problem with dropping hash=True is: how would you write __hash__ yourself?
Is "def __hash__(self): return hash((self.field1, self.field2))" that hard? It is explicit, and you made a concious choice, i.e you understand how __hash__ works. IMO, the danger of "@dataclass(hash=True)" far overweighs whatever convenience it might provide. Elvis
On 2/2/2018 10:38 AM, Elvis Pranskevichus wrote:
On Friday, February 2, 2018 10:08:43 AM EST Eric V. Smith wrote:
However, I don't feel very strongly about this. As I've said, I expect the use cases for hash=True to be very, very rare.
Why do you think that the requirement to make a dataclass hashable is a "very, very rare" requirement? The moment you want to use a dataclass a a dict key, or put it in a set, you need it to be hashable.
I was specifically talking about the case of a non-frozen, hashable class. If you want to make a class frozen and hashable, then: @dataclass(frozen=True) will do just that. The case you brought up initially is the non-frozen, hashable class. It's that case that I think is very rare. I'll ask again: what's your use case for wanting a non-frozen, hashable class? I'm genuinely interested. You seem to think that hash=True means "make the class hashable". That's not true. It means "add a __hash__" to this class". There are other, better ways to make the class hashable, specifically frozen=True. You might want to read all of https://bugs.python.org/issue32513 for the background on the current behavior.
Just put yourself in the shoes of an average Python developer. You try to put a dataclass in a set, you get a TypeError. Your immediate reaction is to add "hash=True". Things appear to work. Then, you, or someone else, decides to mutate the dataclass object and then you are looking at a very frustrating debug session.
I will be documented that the correct way to do this is frozen=True.
In all, I think we're better off documenting best practices and making them the default, like attrs does, and leave it to the programmer to follow them. I realize we're handing out footguns
I don't think attrs doing the same thing is a valid justification. This is a major footgun that is very easy to trigger, and there's really no precedent in standard data types.
the alternatives seem even more complex and are limiting.
The alternative is simple and follows the design of other standard containers: immutable containers are hashable, mutable containers are not. @dataclass(frozen=False) gives you a SimpleNamespace-like and @dataclass(frozen=True) gives you a namedtuple-like. If you _really_ know what you are doing, then you can always declare an explicit __hash__.
I'm not sure what you're arguing for here. This is how dataclasses work.
The problem with dropping hash=True is: how would you write __hash__ yourself?
Is "def __hash__(self): return hash((self.field1, self.field2))" that hard? It is explicit, and you made a concious choice, i.e you understand how __hash__ works.
I didn't say it was hard, I said it needed to be kept up to date as you add fields. That is, you'd have to duplicate the field list. dataclasses is trying to prevent you from repeating the field list anywhere.
IMO, the danger of "@dataclass(hash=True)" far overweighs whatever convenience it might provide.
We'll just have to disagree about this. As I said, I don't feel very strongly about it, but I lean toward leaving it in and documenting it for what it is and does. Eric
On Friday, February 2, 2018 10:51:11 AM EST Eric V. Smith wrote:
I was specifically talking about the case of a non-frozen, hashable class. If you want to make a class frozen and hashable, then:
@dataclass(frozen=True)
will do just that.
The case you brought up initially is the non-frozen, hashable class. It's that case that I think is very rare. I'll ask again: what's your use case for wanting a non-frozen, hashable class? I'm genuinely interested.
My point is exactly that there is _no_ valid use case, so (hash=True, frozen=False) should not be a thing! Why are you so insistent on adding a dangerous option which you admit is nearly useless? Elvis
On 2/2/2018 10:56 AM, Elvis Pranskevichus wrote:
On Friday, February 2, 2018 10:51:11 AM EST Eric V. Smith wrote:
I was specifically talking about the case of a non-frozen, hashable class. If you want to make a class frozen and hashable, then:
@dataclass(frozen=True)
will do just that.
The case you brought up initially is the non-frozen, hashable class. It's that case that I think is very rare. I'll ask again: what's your use case for wanting a non-frozen, hashable class? I'm genuinely interested.
My point is exactly that there is _no_ valid use case, so (hash=True, frozen=False) should not be a thing! Why are you so insistent on adding a dangerous option which you admit is nearly useless?
Because it's not the default, it will be documented as being an advanced use case, and it's useful in rare instances. And as I've said a number of times, both here and in other discussions, I'm not arguing strenuously for this. I just think that, given that it's not the default and it's not recommended and is useful in advanced cases, I would prefer to leave it in. I understand that you disagree with me. Eric
Because it's not the default, it will be documented as being an advanced use case, and it's useful in rare instances.
And as I've said a number of times, both here and in other discussions, I'm not arguing strenuously for this. I just think that, given that it's not the default and it's not recommended and is useful in advanced cases, I would prefer to leave it in. I understand that you disagree with me.
Is there a real world example of such an "advanced case"? Eric, have you read https://github.com/python-attrs/attrs/issues/136 ? Specifically this comment from Hynek [1]: "I never really thought about it, but yeah mutable objects shouldn’t have a __hash__ at all." It is clear from that thread that "hash=True" was an early design mistake, which was left in for compatibility reasons. Why are we copying bad design to the standard library? Elvis [1] https://github.com/python-attrs/attrs/issues/ 136#issuecomment-277425421
On 02/02/2018 08:09 AM, Eric V. Smith wrote:
On 2/2/2018 10:56 AM, Elvis Pranskevichus wrote:
My point is exactly that there is _no_ valid use case, so (hash=True, frozen=False) should not be a thing! Why are you so insistent on adding a dangerous option which you admit is nearly useless?
Because it's not the default, it will be documented as being an advanced use case, and it's useful in rare instances.
Personally, I don't think advanced use-cases need to be supported by flags as they can be supported by just writing the __dunder__ methods. -- ~Ethan~
I agree with Ethan, Elvis, and a few others. I think 'hash=True,
frozen=False' should be disabled in 3.7. It's an attractive nuisance.
Maybe not so attractive because its obscurity, but still with no clear
reason to exist.
If many users of of dataclass find themselves defining '__hash__' with
mutable dataclass, it's perfectly possible to allow the switch combination
later. But taking it out after previously allowing it—even if every use in
the wild is actually a bug in waiting—is harder.
On Feb 2, 2018 2:10 PM, "Ethan Furman"
On 02/02/2018 08:09 AM, Eric V. Smith wrote:
On 2/2/2018 10:56 AM, Elvis Pranskevichus wrote:
My point is exactly that there is _no_ valid use case, so (hash=True,
frozen=False) should not be a thing! Why are you so insistent on adding a dangerous option which you admit is nearly useless?
Because it's not the default, it will be documented as being an advanced use case, and it's useful in rare instances.
Personally, I don't think advanced use-cases need to be supported by flags as they can be supported by just writing the __dunder__ methods.
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz% 40gnosis.cx
IMO, the danger of "@dataclass(hash=True)" far overweighs whatever convenience it might provide.
Is there any reason specifying has=True could set frozen=True unless the user specifically sets frozen=False? Or is that already the case? I think the folks that are concerned about this issue are quite right — most Python users equate immutable and hashable—so the dataclass API should reflect that. And this would still make it easy and clear to specify the unusual (and arguably dangerous) case of: hash=True, frozen=False -CHB
Looks like this is turning into a major flamewar regardless of what I say. :-( I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do. I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it. Here's my use case. A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external. My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy. I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger. -- --Guido van Rossum (python.org/~guido)
+1 using unsafe_hash as a name addresses my concern. It's a good signal
that there are caveats worth considering.
-gps
On Sun, Feb 4, 2018, 9:50 PM Guido van Rossum
Looks like this is turning into a major flamewar regardless of what I say. :-(
I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do.
I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it.
Here's my use case.
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg%40krypto.org
On Sun, Feb 4, 2018 at 11:57 PM, Gregory P. Smith
+1 using unsafe_hash as a name addresses my concern.
mine too -- anyone surprised by using this deserves what they get :-)
-CHB
On Sun, Feb 4, 2018, 9:50 PM Guido van Rossum
Looks like this is turning into a major flamewar regardless of what I say. :-(
I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do.
I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it.
Here's my use case.
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ greg%40krypto.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ chris.barker%40noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Absolutely I agree. 'unsafe_hash' as a name is clear warning to users.
On Feb 4, 2018 10:43 PM, "Chris Barker"
+1 using unsafe_hash as a name addresses my concern.
mine too -- anyone surprised by using this deserves what they get :-)
-CHB
On Sun, Feb 4, 2018, 9:50 PM Guido van Rossum
Looks like this is turning into a major flamewar regardless of what I say. :-(
I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do.
I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it.
Here's my use case.
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/greg% 40krypto.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/chris. barker%40noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ mertz%40gnosis.cx
On Mon, Feb 05, 2018 at 10:50:21AM -0800, David Mertz wrote:
Absolutely I agree. 'unsafe_hash' as a name is clear warning to users.
(I don't mean to pick on David specifically, I had to reply to some message in this thread and I just picked his.) I'm rather gobsmacked at the attitudes of many people here about hashing data classes. I thought *I* was the cynical pessimist who didn't have a high opinion of the quality of the average programmer, but according to this thread apparently I'm positively Pollyanna-esque for believing that most people will realise that if an API offers separate switches for hashable and frozen, you need to set *both* if you want both. Greg Smith even says that writing dunders apart from __init__ is a code smell, and warns people not to write dunders. Seriously? I get that __hash__ is hard to write correctly, which is why we have a hash=True to do the hard work for us, but I can't help feeling that at the point we're saying "don't write dunders, any dunder, you'll only do it wrong" we have crossed over to the wrong side of the pessimist/optimist line. But here we are: talking about naming a perfectly reasonable argument "unsafe_hash". Why are we trying to frighten people? There is nothing unsafe about a DataClass with hash=True, frozen=True, but this scheme means that even people who know what they're doing will write unsafe_hash=True, frozen=True, as if hashability was some sort of hand grenade waiting to go off. Perhaps we ought to deprecate __hash__ and start calling it __danger_danger_hash__ too? No, I don't think so. In the past, we've (rightly!) rejected proposals to call things like eval "unsafe_eval", and that really is dangerously unsafe when used naively with untrusted, unsanitised data. Hashing mutable objects by accident might be annoyingly difficult and frustrating to debug, but code injection attacks can lead to identity theft and worse, serious consequences for real people. I'm 100% in favour of programmer education, but I think this label is *miseducation*. We're suggesting that hashability is unsafe, regardless of whether the object is frozen or not. I'd far prefer to get a runtime warning: "Are you sure you want hash=True without frozen=True?" (or words to that extent) rather than burden all uses of the hash parameter, good and bad, with the unsafe label. -- Steve
Where do you get the impression that one would have to explicitly request
__hash__ if frozen=True is set? To the contrary, my proposal is for
@dataclass to automatically add a __hash__ method when frozen=True is set.
This is what the code currently released as 3.7.0b1 does if hash=None (the
default).
On Tue, Feb 6, 2018 at 9:26 AM, Steven D'Aprano
On Mon, Feb 05, 2018 at 10:50:21AM -0800, David Mertz wrote:
Absolutely I agree. 'unsafe_hash' as a name is clear warning to users.
(I don't mean to pick on David specifically, I had to reply to some message in this thread and I just picked his.)
I'm rather gobsmacked at the attitudes of many people here about hashing data classes. I thought *I* was the cynical pessimist who didn't have a high opinion of the quality of the average programmer, but according to this thread apparently I'm positively Pollyanna-esque for believing that most people will realise that if an API offers separate switches for hashable and frozen, you need to set *both* if you want both.
Greg Smith even says that writing dunders apart from __init__ is a code smell, and warns people not to write dunders. Seriously? I get that __hash__ is hard to write correctly, which is why we have a hash=True to do the hard work for us, but I can't help feeling that at the point we're saying "don't write dunders, any dunder, you'll only do it wrong" we have crossed over to the wrong side of the pessimist/optimist line.
But here we are: talking about naming a perfectly reasonable argument "unsafe_hash". Why are we trying to frighten people?
There is nothing unsafe about a DataClass with hash=True, frozen=True, but this scheme means that even people who know what they're doing will write unsafe_hash=True, frozen=True, as if hashability was some sort of hand grenade waiting to go off.
Perhaps we ought to deprecate __hash__ and start calling it __danger_danger_hash__ too? No, I don't think so.
In the past, we've (rightly!) rejected proposals to call things like eval "unsafe_eval", and that really is dangerously unsafe when used naively with untrusted, unsanitised data. Hashing mutable objects by accident might be annoyingly difficult and frustrating to debug, but code injection attacks can lead to identity theft and worse, serious consequences for real people.
I'm 100% in favour of programmer education, but I think this label is *miseducation*. We're suggesting that hashability is unsafe, regardless of whether the object is frozen or not.
I'd far prefer to get a runtime warning:
"Are you sure you want hash=True without frozen=True?"
(or words to that extent) rather than burden all uses of the hash parameter, good and bad, with the unsafe label.
-- Steve _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 02/06/2018 09:38 AM, Guido van Rossum wrote:
Where do you get the impression that one would have to explicitly request __hash__ if frozen=True is set? To the contrary, my proposal is for @dataclass to automatically add a __hash__ method when frozen=True is set. This is what the code currently released as 3.7.0b1 does if hash=None (the default).
Which is my issue with the naming -- although, really, it's more with the parameter/argument: in a hand-written class, __hash__ = None means the object in is not hashable, but with the decorator: @dataclass(..., hash=None, ...) it means something else. My preference for "fixing" the issue: 1) make the default be a custom object (not None), so that `hash=None` means disable hashing 2) change the param name -- maybe to `add_hash` (I agree with D'Aprano that `unsafe_hash` can be misleading) -- ~Ethan~
We may be in violent agreement.
I propose *not* to add a way to *disable* hashing when the rest of the
flags to @dataclass() would indicate that it's safe to add a __hash__
method.
I propose that with @dataclass(unsafe_hash=False) (the default), a __hash__
method is added when the following conditions are true:
- frozen=True (not the default)
- compare=True (the default)
- no __hash__ method is defined in the class
If we instead use @dataclass(unsafe_hash=True), a __hash__ will be added
regardless of the other flags, but if a __hash__ method is present, an
exception is raised.
Other values (e.g. unsafe_hash=None) are illegal for this flag.
Note that the the hash= flag to the field() function is unchanged from
what's currently in PEP 557 or in the implementation in 3.7.0b1. In
particular, the generated __hash__ method will disregard fields declared
using field(hash=False). It will also disregard fields declared using
field(compare=False, hash=False|None).
On Tue, Feb 6, 2018 at 10:11 AM, Ethan Furman
On 02/06/2018 09:38 AM, Guido van Rossum wrote:
Where do you get the impression that one would have to explicitly request
__hash__ if frozen=True is set? To the contrary, my proposal is for @dataclass to automatically add a __hash__ method when frozen=True is set. This is what the code currently released as 3.7.0b1 does if hash=None (the default).
Which is my issue with the naming -- although, really, it's more with the parameter/argument: in a hand-written class,
__hash__ = None
means the object in is not hashable, but with the decorator:
@dataclass(..., hash=None, ...)
it means something else.
My preference for "fixing" the issue:
1) make the default be a custom object (not None), so that `hash=None` means disable hashing
2) change the param name -- maybe to `add_hash` (I agree with D'Aprano that `unsafe_hash` can be misleading)
-- ~Ethan~
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido% 40python.org
-- --Guido van Rossum (python.org/~guido)
On 02/06/2018 11:18 AM, Guido van Rossum wrote:
We may be in violent agreement.
I propose *not* to add a way to *disable* hashing when the rest of the flags to @dataclass() would indicate that it's safe to add a __hash__ method.
Okay.
I propose that with @dataclass(unsafe_hash=False) (the default), a __hash__ method is added when the following conditions are true:
- frozen=True (not the default) - compare=True (the default) - no __hash__ method is defined in the class
If we instead use @dataclass(unsafe_hash=True), a __hash__ will be added regardless of the other flags, but if a __hash__ method is present, an exception is raised.
Other values (e.g. unsafe_hash=None) are illegal for this flag.
Ah! Excellent, that greatly allays my worries.
Note that the the hash= flag to the field() function is unchanged from what's currently in PEP 557 or in the implementation in 3.7.0b1. In particular, the generated __hash__ method will disregard fields declared using field(hash=False). It will also disregard fields declared using field(compare=False, hash=False|None).
It sounds like `unsafe_hash=True` indicates a truly unsafe hash (that is, mutable data is involved in the hash calculation), but there still seems to be one possibility for an "unsafe_hash" to actually be safe -- that is, if only immutable fields are used in __eq__, then dataclass could safely generate a hash for us. Do we have a way to know if the equality fields are hashable? I suppose we could check each one for a for a non-None __hash__. Then we could modify that first condition from - frozen=True to - frozen=True or all(getattr(eq_fld, '__hash__', None) is not None for eq_field in equality_fields) Thoughts? On a different note, should the PEP be updated with the current signature? It still talks about hash=None being the default. -- ~Ethan~
On Tue, Feb 6, 2018 at 11:40 AM, Ethan Furman
It sounds like `unsafe_hash=True` indicates a truly unsafe hash (that is, mutable data is involved in the hash calculation), but there still seems to be one possibility for an "unsafe_hash" to actually be safe -- that is, if only immutable fields are used in __eq__, then dataclass could safely generate a hash for us.
Do we have a way to know if the equality fields are hashable? I suppose we could check each one for a for a non-None __hash__. Then we could modify that first condition from
- frozen=True
to
- frozen=True or all(getattr(eq_fld, '__hash__', None) is not None for eq_field in equality_fields)
There seems to be a misunderstanding underlying these questions. Even if all fields have an immutable type (e.g. all ints, supporting __eq__ and __hash__), if the containing class isn't frozen, they can be assigned to. E.g. @dataclass() class Point: x: int y: int p = Point(1, 1) p.x = 2 # This is legal The only way to make that assignment to p.x illegal is to make the *class* frozen (using @dataclass(frozen=True)) -- nothing we can do about the *field* will change this. Of course if you use @dataclass(frozen=True, unsafe_hash=True) you may still get a safe hash. :-) -- --Guido van Rossum (python.org/~guido)
On 02/06/2018 12:24 PM, Guido van Rossum wrote:
On Tue, Feb 6, 2018 at 11:40 AM, Ethan Furman wrote:
It sounds like `unsafe_hash=True` indicates a truly unsafe hash (that is, mutable data is involved in the hash calculation), but there still seems to be one possibility for an "unsafe_hash" to actually be safe -- that is, if only immutable fields are used in __eq__, then dataclass could safely generate a hash for us.
Do we have a way to know if the equality fields are hashable? I suppose we could check each one for a for a non-None __hash__. Then we could modify that first condition from
- frozen=True
to
- frozen=True or all(getattr(eq_fld, '__hash__', None) is not None for eq_field in equality_fields)
There seems to be a misunderstanding underlying these questions. Even if all fields have an immutable type (e.g. all ints, supporting __eq__ and __hash__), if the containing class isn't frozen, they can be assigned to. E.g.
@dataclass() class Point: x: int y: int
p = Point(1, 1) p.x = 2 # This is legal
The only way to make that assignment to p.x illegal is to make the *class* frozen (using @dataclass(frozen=True)) -- nothing we can do about the *field* will change this.
Oh, right. When I was thinking this I thought a field could be frozen individually, didn't find the option at the field level when I checked the PEP, and then promptly forgot and suggested it anyway. Although, couldn't we add a field-level frozen attribute (using property for the implementation), and check that all equality fields are properties as well as hashable? -- ~Ethan~
On Tue, Feb 6, 2018 at 12:44 PM, Ethan Furman
Although, couldn't we add a field-level frozen attribute (using property for the implementation), and check that all equality fields are properties as well as hashable?
That would be a totally unrelated feature request. Let's wait whether it's actually needed a lot before designing it. -- --Guido van Rossum (python.org/~guido)
Honestly, the name I would most want for the keyword argument is '_hash'.
That carries the semantics I desire.
On Feb 6, 2018 10:13 AM, "Ethan Furman"
On 02/06/2018 09:38 AM, Guido van Rossum wrote:
Where do you get the impression that one would have to explicitly request
__hash__ if frozen=True is set? To the contrary, my proposal is for @dataclass to automatically add a __hash__ method when frozen=True is set. This is what the code currently released as 3.7.0b1 does if hash=None (the default).
Which is my issue with the naming -- although, really, it's more with the parameter/argument: in a hand-written class,
__hash__ = None
means the object in is not hashable, but with the decorator:
@dataclass(..., hash=None, ...)
it means something else.
My preference for "fixing" the issue:
1) make the default be a custom object (not None), so that `hash=None` means disable hashing
2) change the param name -- maybe to `add_hash` (I agree with D'Aprano that `unsafe_hash` can be misleading)
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz% 40gnosis.cx
That's much less self-descriptive and harder to search Google or
StackOverflow for. It's also easier to overlook. We really want to send the
signal that this is unsafe and requires serious consideration before it is
turned on.
On Tue, Feb 6, 2018 at 11:57 AM, David Mertz
Honestly, the name I would most want for the keyword argument is '_hash'. That carries the semantics I desire.
On Feb 6, 2018 10:13 AM, "Ethan Furman"
wrote: On 02/06/2018 09:38 AM, Guido van Rossum wrote:
Where do you get the impression that one would have to explicitly request
__hash__ if frozen=True is set? To the contrary, my proposal is for @dataclass to automatically add a __hash__ method when frozen=True is set. This is what the code currently released as 3.7.0b1 does if hash=None (the default).
Which is my issue with the naming -- although, really, it's more with the parameter/argument: in a hand-written class,
__hash__ = None
means the object in is not hashable, but with the decorator:
@dataclass(..., hash=None, ...)
it means something else.
My preference for "fixing" the issue:
1) make the default be a custom object (not None), so that `hash=None` means disable hashing
2) change the param name -- maybe to `add_hash` (I agree with D'Aprano that `unsafe_hash` can be misleading)
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mertz%40g nosis.cx
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 2/4/2018 9:49 PM, Guido van Rossum wrote:
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
This is an interesting use case. I haven't got the internals knowledge to know just how just different mutable and immutable classes and objects are under the hood. But this use case makes me wonder if, even at the cost of some performance that "normal" immutable classes and objects might obtain, if it would be possible to use the various undisciplined initialization patterns as desired, followed by as declaration "This OBJECT is now immutable" which would calculate its HASH value, and prevent future mutations of the object? Yes, I'm aware that the decision for immutability has historically been done at the class level, not the object level, but in my ignorance of the internals, I wonder if that is necessary, for performance or more importantly, for other reasons. And perhaps the implementation is internally almost like two classes, one mutable, and the other immutable, and the declaration would convert the object from one to the other. But if I say more, I'd just be babbling.
On Sun, Feb 4, 2018 at 11:28 PM, Glenn Linderman
This is an interesting use case. I haven't got the internals knowledge to know just how just different mutable and immutable classes and objects are under the hood. But this use case makes me wonder if, even at the cost of some performance that "normal" immutable classes and objects might obtain, if it would be possible to use the various undisciplined initialization patterns as desired, followed by as declaration "This OBJECT is now immutable" which would calculate its HASH value, and prevent future mutations of the object?
It would be technically possible to support something like @dataclass(freezable=True) class Foo: blah: int foo = Foo() # Initially, object is mutable, and hash(foo) raises an error foo.blah = 1 assertRaises(hash, foo) # This method is automatically generated for classes with freezable=True foo.freeze() # Now object is immutable, and hash(foo) is allowed assertRaises(foo.__setattr__, "blah", 2) hash(foo) I don't know if it's worth the complexity, but I guess it would cover at least some of the use cases Guido raised. -n -- Nathaniel J. Smith -- https://vorpus.org
On 2/5/2018 12:11 AM, Nathaniel Smith wrote:
On Sun, Feb 4, 2018 at 11:28 PM, Glenn Linderman
wrote: This is an interesting use case. I haven't got the internals knowledge to know just how just different mutable and immutable classes and objects are under the hood. But this use case makes me wonder if, even at the cost of some performance that "normal" immutable classes and objects might obtain, if it would be possible to use the various undisciplined initialization patterns as desired, followed by as declaration "This OBJECT is now immutable" which would calculate its HASH value, and prevent future mutations of the object? It would be technically possible to support something like
@dataclass(freezable=True) class Foo: blah: int
foo = Foo() # Initially, object is mutable, and hash(foo) raises an error foo.blah = 1 assertRaises(hash, foo)
# This method is automatically generated for classes with freezable=True foo.freeze()
# Now object is immutable, and hash(foo) is allowed assertRaises(foo.__setattr__, "blah", 2) hash(foo)
I don't know if it's worth the complexity, but I guess it would cover at least some of the use cases Guido raised.
-n
Thanks, Nathaniel, for confirming that what I was suggesting is not impossible, even if it turns out to be undesirable for some reason, or unwanted by anyone else. But I have encountered a subset of the use cases Guido mentioned, and had to make a 2nd class to gather/hold the values of the eventual immutable class, before I could make it, because pieces of the data for the class values were obtained from different sources at different times. Once all collected, then the immutability could be obtained, the rest of the processing performed. Thrashes the allocator pretty well doing it that way, but the job got done.
On 2/5/2018 2:28 AM, Glenn Linderman wrote:
This is an interesting use case. I haven't got the internals knowledge to know just how just different mutable and immutable classes and objects are under the hood.
I believe there is no internal difference. An object is immutable if there is not way to mutate it with Python code that not poke into internals, such as one can do with ctypes or 3rd party extensions. Numbers and strings have no mutation methods, including no .__init__. A tuple is a fixed sequence of objects and has no .__init__. But if any object in a tuple is mutable, then the tuple is. But the tuple does not know its status, and there is no 'is_mutable' function. However, tuple.__hash__ calls the .__hash__ method of each object and if that is missing for one, tuple.__hash raises.
hash((1, 'a', [])) Traceback (most recent call last): File "
", line 1, in <module> hash((1, 'a', [])) TypeError: unhashable type: 'list'
The built-in immutable objects are mutated from their initial blank values in the C code of their .__new__ methods. So they are only 'immutable' once constructed. Guido pointed out that users constructing objects in Python code might reasonably do so other than only with .__new__, but still want to treat the object as immutable once constructed. In Lisp, for instance, lists are actually trees. To be immutable, they can only be singly linked and must be constructed from leaf nodes to the root (or head). Python programmers should be able to link in both directions and start from the root, and still treat the result as frozen and hashable.
But this use case makes me wonder if, even at the cost of some performance that "normal" immutable classes and objects might obtain, if it would be possible to use the various undisciplined initialization patterns as desired, followed by as declaration "This OBJECT is now immutable" which would calculate its HASH value, and prevent future mutations of the object?
Something like this has been proposed, at least for dicts, and rejected. -- Terry Jan Reedy
On 5 February 2018 at 15:49, Guido van Rossum
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
Aye, calling the flag unsafe_hash would convert me from -1 to -0. The remaining -0 is because I think there's a different and more robust way to tackle your example use case: # Mutable initialization phase >>> from dataclasses import dataclass >>> @dataclass ... class Example: ... a: int ... b: int ... >>> c = Example(None, None) >>> c Example(a=None, b=None) >>> c.a = 1 >>> c.b = 2 >>> c Example(a=1, b=2) # Frozen usage phase >>> @dataclass(frozen=True) ... class LockedExample(Example): ... pass ... >>> c.__class__ = LockedExample >>> c.a = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ncoghlan/devel/cpython/Lib/dataclasses.py", line 448, in _frozen_setattr raise FrozenInstanceError(f'cannot assign to field {name!r}') dataclasses.FrozenInstanceError: cannot assign to field 'a' >>> c.b = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ncoghlan/devel/cpython/Lib/dataclasses.py", line 448, in _frozen_setattr raise FrozenInstanceError(f'cannot assign to field {name!r}') dataclasses.FrozenInstanceError: cannot assign to field 'b' >>> hash(c) 3713081631934410656 The gist of that approach is to assume that there will be *somewhere* in the code where it's possible to declare the construction of the instance "complete", and flip the nominal class over to the frozen subclass to make further mutation unlikely, even though the true underlying type is still the mutable version. That said, if we do provide "unsafe_hash", then the documentation for that flag becomes a place where we can explicitly suggest using a frozen subclass instead. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I'm sorry, but a solution that requires __class__ assignment is way too
fragile for my taste.
On Mon, Feb 5, 2018 at 6:28 AM, Nick Coghlan
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse
On 5 February 2018 at 15:49, Guido van Rossum
wrote: tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
Aye, calling the flag unsafe_hash would convert me from -1 to -0.
The remaining -0 is because I think there's a different and more robust way to tackle your example use case:
# Mutable initialization phase >>> from dataclasses import dataclass >>> @dataclass ... class Example: ... a: int ... b: int ... >>> c = Example(None, None) >>> c Example(a=None, b=None) >>> c.a = 1 >>> c.b = 2 >>> c Example(a=1, b=2)
# Frozen usage phase >>> @dataclass(frozen=True) ... class LockedExample(Example): ... pass ... >>> c.__class__ = LockedExample >>> c.a = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ncoghlan/devel/cpython/Lib/dataclasses.py", line 448, in _frozen_setattr raise FrozenInstanceError(f'cannot assign to field {name!r}') dataclasses.FrozenInstanceError: cannot assign to field 'a' >>> c.b = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ncoghlan/devel/cpython/Lib/dataclasses.py", line 448, in _frozen_setattr raise FrozenInstanceError(f'cannot assign to field {name!r}') dataclasses.FrozenInstanceError: cannot assign to field 'b' >>> hash(c) 3713081631934410656
The gist of that approach is to assume that there will be *somewhere* in the code where it's possible to declare the construction of the instance "complete", and flip the nominal class over to the frozen subclass to make further mutation unlikely, even though the true underlying type is still the mutable version.
That said, if we do provide "unsafe_hash", then the documentation for that flag becomes a place where we can explicitly suggest using a frozen subclass instead.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
-- --Guido van Rossum (python.org/~guido)
Sorry for the late reply. Still recovering from a computer failure. My only concern with this approach is: what if you don’t want any __hash__ added? Say you want to use your base class’s hashing. I guess you could always “del cls.__hash__” after the class is created, but it’s not elegant. That’s what we got from the tri-state option: never add (False), always add (True), or add if it’s safe (None). -- Eric
On Feb 5, 2018, at 12:49 AM, Guido van Rossum
wrote: Looks like this is turning into a major flamewar regardless of what I say. :-(
I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do.
I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it.
Here's my use case.
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
-- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/eric%2Ba-python-dev%40tru...
That seems a rare case (though I hadn't thought of it). I had thought of
the use case where you want a frozen type without a hash; that you can
presumably implement using
def __hash__(self): raise TypeError("not hashable")
We can do a similar thing to preserve the superclass __hash__ if it's rare
enough:
def __hash__(self): return super().__hash__()
If at all possible I'd like to kill the tri-state hash= flag -- the amount
of time spent creating and discussing the huge table in the bpo issue are
an indication of how much effort it would take other people to understand
it.
If either of those use cases becomes annoyingly common we'll have to think
of something else.
On Tue, Feb 6, 2018 at 5:38 PM, Eric V. Smith
Sorry for the late reply. Still recovering from a computer failure.
My only concern with this approach is: what if you don’t want any __hash__ added? Say you want to use your base class’s hashing. I guess you could always “del cls.__hash__” after the class is created, but it’s not elegant.
That’s what we got from the tri-state option: never add (False), always add (True), or add if it’s safe (None).
-- Eric
On Feb 5, 2018, at 12:49 AM, Guido van Rossum
wrote: Looks like this is turning into a major flamewar regardless of what I say. :-(
I really don't want to lose the ability to add a hash function to a mutable dataclass by flipping a flag in the decorator. I'll explain below. But I am fine if this flag has a name that clearly signals it's an unsafe thing to do.
I propose to replace the existing (as of 3.7.0b1) hash= keyword for the @dataclass decorator with a simpler flag named unsafe_hash=. This would be a simple bool (not a tri-state flag like the current hash=None|False|True). The default would be False, and the behavior then would be to add a hash function automatically only if it's safe (using the same rules as for hash=None currently). With unsafe_hash=True, a hash function would always be generated that takes all fields into account except those declared using field(hash=False). If there's already a `def __hash__` in the function I don't care what it does, maybe it should raise rather than quietly doing nothing or quietly overwriting it.
Here's my use case.
A frozen class requires a lot of discipline, since you have to compute the values of all fields before calling the constructor. A mutable class allows other initialization patterns, e.g. manually setting some fields after the instance has been constructed, or having a separate non-dunder init() method. There may be good reasons for using these patterns, e.g. the object may be part of a cycle (e.g. parent/child links in a tree). Or you may just use one of these patterns because you're a pretty casual coder. Or you're modeling something external.
My point is that once you have one of those patterns in place, changing your code to avoid them may be difficult. And yet your code may treat the objects as essentially immutable after the initialization phase (e.g. a parse tree). So if you create a dataclass and start coding like that for a while, and much later you need to put one of these into a set or use it as a dict key, switching to frozen=True may not be a quick option. And writing a __hash__ method by hand may feel like a lot of busywork. So this is where [unsafe_]hash=True would come in handy.
I think naming the flag unsafe_hash should take away most objections, since it will be clear that this is not a safe thing to do. People who don't understand the danger are likely to copy a worse solution from StackOverflow anyway. The docs can point to frozen=True and explain the danger.
-- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ eric%2Ba-python-dev%40trueblade.com
-- --Guido van Rossum (python.org/~guido)
On 02/06/2018 06:48 PM, Guido van Rossum wrote:
That seems a rare case (though I hadn't thought of it). I had thought of the use case where you want a frozen type without a hash; that you can presumably implement using
def __hash__(self): raise TypeError("not hashable")
We can do a similar thing to preserve the superclass __hash__ if it's rare enough:
def __hash__(self): return super().__hash__()
If at all possible I'd like to kill the tri-state hash= flag -- the amount of time spent creating and discussing the huge table in the bpo issue are an indication of how much effort it would take other people to understand it.
I think the biggest reason this has become so complicated is because we are refusing to use an Enum: class Hashable(Enum): IF_SAFE = 1 ADD = 2 DEFER = 3 NONE = 4 IF_SAFE is currently the False value. ADD is currently the True value DEFER means don't add one NONE means set __hash__ to None The only thing missing now is a setting to indicate that dataclass should do nothing if the class already has a __hash__ method -- possibly DEFER, although I think IF_SAFE can include "the class already has one, it's not safe to override it".
If either of those use cases becomes annoyingly common we'll have to think of something else.
Or we could solve it now and not have to deal with backwards-compatibility issues in the future. -- ~Ethan~
I'm not a fan, sorry.
On Tue, Feb 6, 2018 at 7:33 PM, Ethan Furman
On 02/06/2018 06:48 PM, Guido van Rossum wrote:
That seems a rare case (though I hadn't thought of it). I had thought of
the use case where you want a frozen type without a hash; that you can presumably implement using
def __hash__(self): raise TypeError("not hashable")
We can do a similar thing to preserve the superclass __hash__ if it's rare enough:
def __hash__(self): return super().__hash__()
If at all possible I'd like to kill the tri-state hash= flag -- the amount of time spent creating and discussing the huge table in the bpo issue are an indication of how much effort it would take other people to understand it.
I think the biggest reason this has become so complicated is because we are refusing to use an Enum:
class Hashable(Enum): IF_SAFE = 1 ADD = 2 DEFER = 3 NONE = 4
IF_SAFE is currently the False value. ADD is currently the True value DEFER means don't add one NONE means set __hash__ to None
The only thing missing now is a setting to indicate that dataclass should do nothing if the class already has a __hash__ method -- possibly DEFER, although I think IF_SAFE can include "the class already has one, it's not safe to override it".
If either of those use cases becomes annoyingly common we'll have to think
of something else.
Or we could solve it now and not have to deal with backwards-compatibility issues in the future.
-- ~Ethan~ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido% 40python.org
-- --Guido van Rossum (python.org/~guido)
On 7 February 2018 at 12:48, Guido van Rossum
That seems a rare case (though I hadn't thought of it). I had thought of the use case where you want a frozen type without a hash; that you can presumably implement using
def __hash__(self): raise TypeError("not hashable")
Now that attributes in the class dict pre-empt the generated versions, "__hash__ = None" in the class body will also turn off hash generation without enabling hashing.
We can do a similar thing to preserve the superclass __hash__ if it's rare enough:
def __hash__(self): return super().__hash__()
Similarly for this variant, you can do "__hash__ = BaseClassWithDesiredHashAlgorithm.__hash__". (That may be more appropriate than dynamic lookup in the MRO if you're extending a frozen class with a subclass that adds more fields, but you want to keep using the base class hash definition for some reason)
If at all possible I'd like to kill the tri-state hash= flag -- the amount of time spent creating and discussing the huge table in the bpo issue are an indication of how much effort it would take other people to understand it.
+1 - the nice thing about "unsafe_hash=True" is that you *only* need it if adding the hash would be unsafe, and you haven't set __hash__ explicitly in the class body. While "frozen=True, unsafe_hash=True" is redundant (since the hash is safe when instances are frozen), it isn't a surprising defect waiting to happen the way "frozen=False, hash=True" is in the current interface. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2 February 2018 at 15:38, Elvis Pranskevichus
On Friday, February 2, 2018 10:08:43 AM EST Eric V. Smith wrote:
However, I don't feel very strongly about this. As I've said, I expect the use cases for hash=True to be very, very rare.
Why do you think that the requirement to make a dataclass hashable is a "very, very rare" requirement? The moment you want to use a dataclass a a dict key, or put it in a set, you need it to be hashable.
Just put yourself in the shoes of an average Python developer. You try to put a dataclass in a set, you get a TypeError. Your immediate reaction is to add "hash=True". Things appear to work. Then, you, or someone else, decides to mutate the dataclass object and then you are looking at a very frustrating debug session.
If I saw someone try to put a dataclass into a set, I'd point out that dataclasses are *mutable*, and if they want immutable values they should use "frozen=True". If it were me in that situation, that's what I'd do as well. Adding hashability to a mutable object would *never* be my immediate reaction. To put it another way, using your words above, "The moment you want to use a dataclass a a dict key, or put it in a set, you need it to be *immutable*" (not hashable, unless you really know what you're doing). Paul
On Fri, Feb 2, 2018 at 10:51 AM, Paul Moore
To put it another way, using your words above, "The moment you want to use a dataclass a a dict key, or put it in a set, you need it to be *immutable*" (not hashable, unless you really know what you're doing).
Can someone clarify what is the actual use case of someone *knowingly* making a mutable collection hashable? Why can't that advanced user write their own __hash__ implementation? It's easy to do so. For what it's worth I think this argument is being blindly used to justify the current questionable design of "dataclass(hash=True)" being the same as "dataclass(hash=True, frozen=False) case. At least a few other core developers are concerned with this, but all I see is "attrs does the same thing". Eric, in my opinion we shouldn't copy attrs. It was designed as an external package with its own backwards-compatibility story. At some point it was realized that "attrs(hash=True, frozen=False)" is an anti-pattern, but it couldn't be removed at that point. Hence the warning in the documentation. We can do better. We are designing a new API that is going to be hugely popular. Why can't we ship it with dangerous options prohibited in 3.7 (it's easy to do that!) and then enable them in 3.8 when there's an actual clear use case? Yury
On 02/02/2018 08:14 AM, Yury Selivanov wrote:
Eric, in my opinion we shouldn't copy attrs. [...]
We are designing a new API that is going to be hugely popular. Why can't we ship it with dangerous options prohibited in 3.7 (it's easy to do that!) and then enable them in 3.8 when there's an actual clear use case?
+1 -- ~Ethan~
On Fri, Feb 2, 2018 at 7:38 AM, Elvis Pranskevichus
On Friday, February 2, 2018 10:08:43 AM EST Eric V. Smith wrote:
However, I don't feel very strongly about this. As I've said, I expect the use cases for hash=True to be very, very rare.
Why do you think that the requirement to make a dataclass hashable is a "very, very rare" requirement?
I think what's rare is wanting hashability without it being frozen.
Just put yourself in the shoes of an average Python developer. You try to put a dataclass in a set, you get a TypeError. Your immediate reaction is to add "hash=True". Things appear to work.
agreed, the easy and obvious way should be to make it frozen -- if it's hard to make it hashable and not frozen, then that's good. But it is nice to have the __hash__ generated more you.... so maybe a flag for "unfrozen_hashable" -- really klunky, but if that is a rare need, then there you go. Or maybe: If either hash or frozen is specified, it become both frozen and hashable. If both are specified, then it does what the user is asking for. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 3 Feb. 2018 1:09 am, "Eric V. Smith"
It appears Eric and I are the only ones in favor of keeping the current
behavior. But I still am not convinced by all the worries about "attractive
nuisances" and all the other bad names this feature has been called. We
don't know that any of the doomsday scenarios will happen. In my
experience, usually once something is rolled out the set of issues that are
*actually* raised is entirely different from the things its designers were
worried about.
Please don't commit a change to roll this back without checking in with me;
I have some misgivings about the problem being raised here that I still
need to think through more carefully. In the meantime, please try to use
dataclasses with 3.7b1!
On Fri, Feb 2, 2018 at 10:25 PM, Nick Coghlan
On 3 Feb. 2018 1:09 am, "Eric V. Smith"
wrote: The problem with dropping hash=True is: how would you write __hash__ yourself? It seems like a bug magnet if you're adding fields to the class and forget to update __hash__, especially in the presence of per-field hash=False and eq=False settings. And you'd need to make sure it matches the generated __eq__ (if 2 objects are equal, they need to have the same hash value).
I think anyone that does this needs to think *very* carefully about how they do it, and offering both "hash=True" and "frozen=True" is an attractive nuisance that means people will write "hash=True" when what they wanted was "frozen=True".
In particular, having to work out how write a maintainable "__hash__" will encourage folks to separate out the hashed fields as a separate frozen subrecord or base class.
If this proves to be an intolerable burden then the short hand spelling could be added back in 3.8, but once we ship it we're going to be stuck with explaining the interactions indefinitely.
Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 02/02/2018 10:44 PM, Guido van Rossum wrote:
It appears Eric and I are the only ones in favor of keeping the current behavior. But I still am not convinced by all the worries about "attractive nuisances" and all the other bad names this feature has been called. We don't know that any of the doomsday scenarios will happen. In my experience, usually once something is rolled out the set of issues that are *actually* raised is entirely different from the things its designers were worried about.
This may all be true, but consider how many times we have asked, "How does attrs handle that?" It would be wise to also ask, "What pitfalls have attrs discovered, and what would they do different if they could?" -- ~Ethan~
As a Bear of Relatively Little Brain, I've grown up understanding, and
teaching, that mutable things aren't to be used as dict keys. I'm aware
that immutability isn't strictly the required condition, but it for most
people, that's the primary reason for using frozen sets and tuples, for
example, and immutability serves as a practical and comprehensible first
approximation. So I'm at a loss to understand why I am being offered a
feature that (especially during maintenance by a different developer) might
be prone to bizarre errors caused by a change in hash.
I realise that this won't happen very often, but the difficulty of the
debug task should surely merit at least some warning for us bears - you
know, the ones that take your work and use it to do mundane things with.
On a slightly tangential note, us bears are very glad that such questions
are taken seriously and discussed in such depth. Thank you all.
Steve Holden
On Sat, Feb 3, 2018 at 6:44 AM, Guido van Rossum
It appears Eric and I are the only ones in favor of keeping the current behavior. But I still am not convinced by all the worries about "attractive nuisances" and all the other bad names this feature has been called. We don't know that any of the doomsday scenarios will happen. In my experience, usually once something is rolled out the set of issues that are *actually* raised is entirely different from the things its designers were worried about.
Please don't commit a change to roll this back without checking in with me; I have some misgivings about the problem being raised here that I still need to think through more carefully. In the meantime, please try to use dataclasses with 3.7b1!
On Fri, Feb 2, 2018 at 10:25 PM, Nick Coghlan
wrote: On 3 Feb. 2018 1:09 am, "Eric V. Smith"
wrote: The problem with dropping hash=True is: how would you write __hash__ yourself? It seems like a bug magnet if you're adding fields to the class and forget to update __hash__, especially in the presence of per-field hash=False and eq=False settings. And you'd need to make sure it matches the generated __eq__ (if 2 objects are equal, they need to have the same hash value).
I think anyone that does this needs to think *very* carefully about how they do it, and offering both "hash=True" and "frozen=True" is an attractive nuisance that means people will write "hash=True" when what they wanted was "frozen=True".
In particular, having to work out how write a maintainable "__hash__" will encourage folks to separate out the hashed fields as a separate frozen subrecord or base class.
If this proves to be an intolerable burden then the short hand spelling could be added back in 3.8, but once we ship it we're going to be stuck with explaining the interactions indefinitely.
Cheers, Nick.
On Fri, Feb 2, 2018 at 10:25 PM Nick Coghlan
On 3 Feb. 2018 1:09 am, "Eric V. Smith"
wrote: The problem with dropping hash=True is: how would you write __hash__ yourself? It seems like a bug magnet if you're adding fields to the class and forget to update __hash__, especially in the presence of per-field hash=False and eq=False settings. And you'd need to make sure it matches the generated __eq__ (if 2 objects are equal, they need to have the same hash value).
I think anyone that does this needs to think *very* carefully about how they do it, and offering both "hash=True" and "frozen=True" is an attractive nuisance that means people will write "hash=True" when what they wanted was "frozen=True".
In particular, having to work out how write a maintainable "__hash__" will encourage folks to separate out the hashed fields as a separate frozen subrecord or base class.
If this proves to be an intolerable burden then the short hand spelling could be added back in 3.8, but once we ship it we're going to be stuck with explaining the interactions indefinitely.
+1 Nick put words to my chief concerns. It is easy for an author see hash=True in existing code somewhere (cargo culting) and assume it does what they want, or quickly glance at the the API and see "hash=True" without actually taking the time to understand the implications of that to see that the parameter named "frozen" is the one they are supposed to want that _safely_ makes their dataclass properly hashable, not the more attractive parameter named "hash" that enables dangerous behavior. Forcing people who need a __hash__ method to write it explicitly sounds like a good thing to me. I am not at all worried about someone forgetting to add a new field to an implementation of the __hash__ method when adding a new field, the fields and __hash__ method are all defined in the same place in the code. I expect someone with a common need for always having a __hash__ method will produce a library on top of dataclasses that implements something like our current hash=True behavior. If that kind of thing turns out to be widely used, we can reintroduce the feature in dataclasses in 3.8 or later, informed by what we see practical uses actually doing. In my practical experience, people writing Python code do not need to learn and understand the intricacies of what it means to have a __hash__ method, be hashable, or "frozen". We intentionally warn people against writing dunder methods other than __init__ in their code as they are often power features with less obvious semantics than it may seem at first glance making such code harder to maintain. Even calling the parameter "hash=" and saying it adds a __hash__ method as the PEP currently does seems to launder the danger, washing away the "dunder smell" that adding a special considerations __hash__ method carries. The PEP (and presumably forthcoming dataclasses module documentation) says "This is a specialized use case and should be considered carefully" which I agree with. But any time we suggest that in an API, how about having the API name make it clear that this is special and not to be done lightly? I guess i'm arguing against using "hash=" as the arg name in favor of "danger_there_be_vorpal_rabbits_hash_me_maybe=" or something more usefully similar if we're going to have it. -gps
participants (16)
-
Chris Barker
-
Chris Barker - NOAA Federal
-
David Mertz
-
Elvis Pranskevichus
-
Eric V. Smith
-
Ethan Furman
-
Glenn Linderman
-
Gregory P. Smith
-
Guido van Rossum
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Steve Holden
-
Steven D'Aprano
-
Terry Reedy
-
Yury Selivanov