[Python-ideas] incremental hashing in __hash__
Steven D'Aprano
steve at pearwood.info
Wed Jan 4 19:31:08 EST 2017
On Wed, Jan 04, 2017 at 04:38:05PM -0500, jab at math.brown.edu wrote:
> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod?
The public API for calculating the hash of something is to call the
hash() builtin function on some object, e.g. to call tuple.__hash__ you
write hash((a, b, c)). The __hash__ dunder method is implementation, not
interface, and normally shouldn't be called directly.
Unless I'm missing something obvious, your proposal would require the
caller to call the dunder methods directly:
class X:
def __hash__(self):
return tuple.__hash__(iter(self))
I consider that a poor interface design.
But even if we decide to make an exception in this case, tuple.__hash__
is currently an ordinary instance method right now. There's probably
code that relies on that fact and expects that:
tuple.__hash__((a, b, c))
is currently the same as
(a, b, c).__hash__()
(Starting with the hash() builtin itself, I expect, although that is
easy enough to fix if needed.) Your proposal will break backwards
compatibility, as it requires a change in semantics:
(1) (a, b, c).__hash__() must keep the current behaviour, which
means behaving like a bound instance method;
(2) But tuple.__hash__ will no longer return an unbound method (actually
a function object, but the difference is unimportant) and instead will
return something that behaves like a bound class method.
Here's an implementation which does this:
http://code.activestate.com/recipes/577030-dualmethod-descriptor/
so such a thing is possible. But it breaks backwards-compatability and
introduces something which I consider to be an unclean API (calling a
dunder method directly). Unless there's a *really* strong advantage to
tuple.__hash__(...)
over
hash.from_iterable(...)
(or equivalent), I would be against this change.
> (And similarly with frozenset.__hash__(), so that the fast C
> implementation of that algorithm could be used, rather than the slow
> collections.Set._hash() implementation. Then the duplicated implementation
> in _collections_abc.py's Set._hash() could be removed completely,
> delegating to frozenset.__hash__() instead.)
This is a good point. Until now, I've been assuming that
hash.from_iterable should consider order. But frozenset shows us that
sometimes the hash should *not* consider order.
This hints that perhaps the hash.from_iterable() should have its own
optional dunder method. Or maybe we need two functions: an ordered
version and an unordered version.
Hmmm... just tossing out a wild idea here... let's get rid of the dunder
method part of your suggestion, and add new public class methods to
tuple and frozenset:
tuple.hash_from_iter(iterable)
frozenset.hash_from_iter(iterable)
That gets rid of all the objections about backwards compatibility, since
these are new methods. They're not dunder names, so there are no
objections to being used as part of the public API.
A possible objection is the question, is this functionality *actually*
important enough to bother?
Another possible objection: are these methods part of the sequence/set
API? If not, do they really belong on the tuple/frozenset? Maybe they
belong elsewhere?
> Would this API more cleanly communicate the algorithm being used and the
> implementation,
No. If you want to communicate the algorithm being used, write some
documentation.
Seriously, the public API doesn't communicate the algorithm used for the
implementation. How can it? We can keep the same interface and change
the implementation, or change the interface and keep the implementation.
The two are (mostly) independent.
> while making a smaller increase in API surface area
> compared to introducing a new function?
It's difficult to quantify "API surface area". On the one hand, we have
the addition of one or two new functions or methods. Contrast with:
* introducing a new kind of method into the built-ins (one which
behaves like a classmethod when called from the class, and like
an instance method when called from an instance);
* changing tuple.__hash__ from an ordinary method to one of the
above special methods;
* and likewise for frozenset.__hash__;
* change __hash__ from "only used as implementation, not as
interface" to "sometimes used as interface".
To me, adding one or two new methods/functions is the smaller, or at
least less disruptive, change.
--
Steve
More information about the Python-ideas
mailing list