[Python-ideas] incremental hashing in __hash__

Steven D'Aprano steve at pearwood.info
Wed Jan 4 19:31:08 EST 2017


On Wed, Jan 04, 2017 at 04:38:05PM -0500, jab at math.brown.edu wrote:
> Instead of the proposals like "hash.from_iterable()", would it make sense
> to allow tuple.__hash__() to accept any iterable, when called as a
> classmethod? 

The public API for calculating the hash of something is to call the 
hash() builtin function on some object, e.g. to call tuple.__hash__ you 
write hash((a, b, c)). The __hash__ dunder method is implementation, not 
interface, and normally shouldn't be called directly.

Unless I'm missing something obvious, your proposal would require the 
caller to call the dunder methods directly:

class X:
    def __hash__(self):
        return tuple.__hash__(iter(self))

I consider that a poor interface design.

But even if we decide to make an exception in this case, tuple.__hash__ 
is currently an ordinary instance method right now. There's probably 
code that relies on that fact and expects that:

    tuple.__hash__((a, b, c))

is currently the same as 

    (a, b, c).__hash__()


(Starting with the hash() builtin itself, I expect, although that is 
easy enough to fix if needed.) Your proposal will break backwards 
compatibility, as it requires a change in semantics:

(1) (a, b, c).__hash__() must keep the current behaviour, which 
means behaving like a bound instance method;

(2) But tuple.__hash__ will no longer return an unbound method (actually 
a function object, but the difference is unimportant) and instead will 
return something that behaves like a bound class method.

Here's an implementation which does this:

http://code.activestate.com/recipes/577030-dualmethod-descriptor/

so such a thing is possible. But it breaks backwards-compatability and 
introduces something which I consider to be an unclean API (calling a 
dunder method directly). Unless there's a *really* strong advantage to

    tuple.__hash__(...)

over 

    hash.from_iterable(...)

(or equivalent), I would be against this change.



> (And similarly with frozenset.__hash__(), so that the fast C
> implementation of that algorithm could be used, rather than the slow
> collections.Set._hash() implementation. Then the duplicated implementation
> in _collections_abc.py's Set._hash() could be removed completely,
> delegating to frozenset.__hash__() instead.)

This is a good point. Until now, I've been assuming that 
hash.from_iterable should consider order. But frozenset shows us that 
sometimes the hash should *not* consider order.

This hints that perhaps the hash.from_iterable() should have its own 
optional dunder method. Or maybe we need two functions: an ordered 
version and an unordered version.

Hmmm... just tossing out a wild idea here... let's get rid of the dunder 
method part of your suggestion, and add new public class methods to 
tuple and frozenset:

    tuple.hash_from_iter(iterable)
    frozenset.hash_from_iter(iterable)


That gets rid of all the objections about backwards compatibility, since 
these are new methods. They're not dunder names, so there are no 
objections to being used as part of the public API.

A possible objection is the question, is this functionality *actually* 
important enough to bother?

Another possible objection: are these methods part of the sequence/set 
API? If not, do they really belong on the tuple/frozenset? Maybe they 
belong elsewhere?



> Would this API more cleanly communicate the algorithm being used and the
> implementation, 

No. If you want to communicate the algorithm being used, write some 
documentation.

Seriously, the public API doesn't communicate the algorithm used for the 
implementation. How can it? We can keep the same interface and change 
the implementation, or change the interface and keep the implementation. 
The two are (mostly) independent.



> while making a smaller increase in API surface area
> compared to introducing a new function?

It's difficult to quantify "API surface area". On the one hand, we have 
the addition of one or two new functions or methods. Contrast with:

    * introducing a new kind of method into the built-ins (one which 
      behaves like a classmethod when called from the class, and like
      an instance method when called from an instance);

    * changing tuple.__hash__ from an ordinary method to one of the
      above special methods;

    * and likewise for frozenset.__hash__;

    * change __hash__ from "only used as implementation, not as 
      interface" to "sometimes used as interface".


To me, adding one or two new methods/functions is the smaller, or at 
least less disruptive, change.



-- 
Steve


More information about the Python-ideas mailing list