[Python-ideas] incremental hashing in __hash__

jab at math.brown.edu jab at math.brown.edu
Fri Dec 30 18:36:37 EST 2016


On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes <christian at python.org>
wrote:

> Hi,
>
> I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
> hashlib module.
>
> Before we come up with a new API or recipe, I would like to understand
> the problem first. Why does the initial op consider hash(large_tuple) a
> performance issue? If you have an object with lots of members that
> affect both __hash__ and __eq__, then __hash__ is really least of your
> concern. The hash has to be computed just once and then will stay the
> same over the life time of the object. Once computed the hash can be
> cached.
>
> On the other hand __eq__ is at least called once for every successful
> hash lookup. On the worst case it is called n-1 for a dict of size n for
> a match *and* a hashmap miss. Every __eq__ call has to compare between 1
> and m member attributes. For a dict with 1,000 elements with 1,000
> members each, that's just 1,000 hash computations with roughly 8 bB
> memory allocation but almost a million comparisons in worst case.
>
> A hasher objects adds further overhead, e.g. object allocation, creation
> of a bound methods for each call etc. It's also less CPU cache friendly
> than the linear data structure of a tuple.



Thanks for joining the discussion, great to have your input.

In the use cases I described, the objects' members are ordered. So in the
unlikely event that two objects hash to the same value but are unequal, the
__eq__ call should be cheap, because they probably differ in length or on
their first member, and can skip further comparison. After a successful
hash lookup of an object that's already in a set or dict, a successful
identity check avoids an expensive equality check. Perhaps I misunderstood?

Here is some example code:

class FrozenOrderedCollection:
    ...
    def __eq__(self, other):
        if not isinstance(other, FrozenOrderedCollection):
            return NotImplemented
        if len(self) != len(other):
            return False
        return all(i == j for (i, j) in zip(self, other))

    def __hash__(self):
        if hasattr(self, '_hashval'):
            return self._hashval
        hash_initial = hash(self.__class__)
        self._hashval = reduce(lambda h, i: hash((h, i)), self,
hash_initial)
        return self._hashval



Is it the case that internally, the code is mostly there to compute the
hash of such an object in incremental fashion, and it's just a matter of
figuring out the right API to expose it?

If creating an arbitrarily large tuple, passing that to hash(), and then
throwing it away really is the best practice, then perhaps that should be
explicitly documented, since I think many would find that counterintuitive?

@Stephen J. Turnbull, thank you for your input -- still digesting, but very
interesting.

Thanks again to everyone for the helpful discussion.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161230/be303a15/attachment.html>


More information about the Python-ideas mailing list