[Python-ideas] incremental hashing in __hash__

Matt Gilson matt at getpattern.com
Thu Jan 5 04:00:37 EST 2017


But, I think that the problem with adding `__hash__` to
collections.abc.Iterable is that not all iterables are immutable -- And if
they aren't immutable, then allowing them to be hashed is likely to be a
pretty bad idea...

I'm still having a hard time being convinced that this is very much of an
optimization at all ...

If you start hashing tuples that are large enough that memory is a concern,
then that's going to also take a *really* long time and probably be
prohibitive anyway.  Just for kicks, I decided to throw together a simple
script to time how much penalty you pay for hashing a tuple:

class F(object):
    def __init__(self, arg):
        self.arg = arg

    def __hash__(self):
        return hash(tuple(self.arg))


class T(object):
    def __init__(self, arg):
        self.arg = tuple(arg)

    def __hash__(self):
        return hash(self.arg)


class C(object):
    def __init__(self, arg):
        self.arg = tuple(arg)
        self._hash = None

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(self.arg))
        return self._hash

import timeit

print(timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(500)))'))
print(timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(500)))'))
print(timeit.timeit('hash(c)', 'from __main__ import C; c =
C(list(range(500)))'))

results = []
for i in range(1, 11):
    n = i * 100
    t1 = timeit.timeit('hash(f)', 'from __main__ import F; f =
F(list(range(%d)))' % i)
    t2 = timeit.timeit('hash(t)', 'from __main__ import T; t =
T(list(range(%d)))' % i)
    results.append(t1/t2)
print(results)


F is going to create a new tuple each time and then hash it.  T already has
a tuple, so we'll only pay the cost of hashing a tuple, not the cost of
constructing a tuple and C caches the hash value and re-uses it once it is
known.  C is the winner by a factor of 10 or more (no surprise there).  But
the real interesting thing is that the the ratio of the timing results from
hashing `F` vs. `T` is relatively constant in the range of my test (up to
1000 elements) and that ratio's value is approximately 1.3.  For most
applications, that seems reasonable.  If you really need a speed-up, then I
suppose you could recode the thing in Cython and see what happens, but I
doubt that will be frequently necessary.  If you _do_ code it up in Cython,
put it up on Pypi and see if people use it...


On Wed, Jan 4, 2017 at 5:04 PM, Neil Girdhar <mistersheik at gmail.com> wrote:

> Couldn't you add __hash__ to collections.abc.Iterable ?  Essentially,
> expose __hash__ there; then all iterables automatically have a default hash
> that hashes their ordered contents.
>
> On Wednesday, January 4, 2017 at 7:37:26 PM UTC-5, Steven D'Aprano wrote:
>>
>> On Wed, Jan 04, 2017 at 04:38:05PM -0500, j... at math.brown.edu wrote:
>> > Instead of the proposals like "hash.from_iterable()", would it make
>> sense
>> > to allow tuple.__hash__() to accept any iterable, when called as a
>> > classmethod?
>>
>> The public API for calculating the hash of something is to call the
>> hash() builtin function on some object, e.g. to call tuple.__hash__ you
>> write hash((a, b, c)). The __hash__ dunder method is implementation, not
>> interface, and normally shouldn't be called directly.
>>
>> Unless I'm missing something obvious, your proposal would require the
>> caller to call the dunder methods directly:
>>
>> class X:
>>     def __hash__(self):
>>         return tuple.__hash__(iter(self))
>>
>> I consider that a poor interface design.
>>
>> But even if we decide to make an exception in this case, tuple.__hash__
>> is currently an ordinary instance method right now. There's probably
>> code that relies on that fact and expects that:
>>
>>     tuple.__hash__((a, b, c))
>>
>> is currently the same as
>>
>>     (a, b, c).__hash__()
>>
>>
>> (Starting with the hash() builtin itself, I expect, although that is
>> easy enough to fix if needed.) Your proposal will break backwards
>> compatibility, as it requires a change in semantics:
>>
>> (1) (a, b, c).__hash__() must keep the current behaviour, which
>> means behaving like a bound instance method;
>>
>> (2) But tuple.__hash__ will no longer return an unbound method (actually
>> a function object, but the difference is unimportant) and instead will
>> return something that behaves like a bound class method.
>>
>> Here's an implementation which does this:
>>
>> http://code.activestate.com/recipes/577030-dualmethod-descriptor/
>>
>> so such a thing is possible. But it breaks backwards-compatability and
>> introduces something which I consider to be an unclean API (calling a
>> dunder method directly). Unless there's a *really* strong advantage to
>>
>>     tuple.__hash__(...)
>>
>> over
>>
>>     hash.from_iterable(...)
>>
>> (or equivalent), I would be against this change.
>>
>>
>>
>> > (And similarly with frozenset.__hash__(), so that the fast C
>> > implementation of that algorithm could be used, rather than the slow
>> > collections.Set._hash() implementation. Then the duplicated
>> implementation
>> > in _collections_abc.py's Set._hash() could be removed completely,
>> > delegating to frozenset.__hash__() instead.)
>>
>> This is a good point. Until now, I've been assuming that
>> hash.from_iterable should consider order. But frozenset shows us that
>> sometimes the hash should *not* consider order.
>>
>> This hints that perhaps the hash.from_iterable() should have its own
>> optional dunder method. Or maybe we need two functions: an ordered
>> version and an unordered version.
>>
>> Hmmm... just tossing out a wild idea here... let's get rid of the dunder
>> method part of your suggestion, and add new public class methods to
>> tuple and frozenset:
>>
>>     tuple.hash_from_iter(iterable)
>>     frozenset.hash_from_iter(iterable)
>>
>>
>> That gets rid of all the objections about backwards compatibility, since
>> these are new methods. They're not dunder names, so there are no
>> objections to being used as part of the public API.
>>
>> A possible objection is the question, is this functionality *actually*
>> important enough to bother?
>>
>> Another possible objection: are these methods part of the sequence/set
>> API? If not, do they really belong on the tuple/frozenset? Maybe they
>> belong elsewhere?
>>
>>
>>
>> > Would this API more cleanly communicate the algorithm being used and
>> the
>> > implementation,
>>
>> No. If you want to communicate the algorithm being used, write some
>> documentation.
>>
>> Seriously, the public API doesn't communicate the algorithm used for the
>> implementation. How can it? We can keep the same interface and change
>> the implementation, or change the interface and keep the implementation.
>> The two are (mostly) independent.
>>
>>
>>
>> > while making a smaller increase in API surface area
>> > compared to introducing a new function?
>>
>> It's difficult to quantify "API surface area". On the one hand, we have
>> the addition of one or two new functions or methods. Contrast with:
>>
>>     * introducing a new kind of method into the built-ins (one which
>>       behaves like a classmethod when called from the class, and like
>>       an instance method when called from an instance);
>>
>>     * changing tuple.__hash__ from an ordinary method to one of the
>>       above special methods;
>>
>>     * and likewise for frozenset.__hash__;
>>
>>     * change __hash__ from "only used as implementation, not as
>>       interface" to "sometimes used as interface".
>>
>>
>> To me, adding one or two new methods/functions is the smaller, or at
>> least less disruptive, change.
>>
>>
>>
>> --
>> Steve
>> _______________________________________________
>> Python-ideas mailing list
>> Python... at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 

[image: pattern-sig.png]

Matt Gilson // SOFTWARE ENGINEER

E: matt at getpattern.com // P: 603.892.7736

We’re looking for beta testers.  Go here
<https://www.getpattern.com/meetpattern> to sign up!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170105/62b72661/attachment-0001.html>


More information about the Python-ideas mailing list