On Fri, Dec 30, 2016 at 3:54 PM, Christian Heimes <christian@python.org> wrote:
Hi,

I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.

Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.

On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.

A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.


Thanks for joining the discussion, great to have your input.

In the use cases I described, the objects' members are ordered. So in the unlikely event that two objects hash to the same value but are unequal, the __eq__ call should be cheap, because they probably differ in length or on their first member, and can skip further comparison. After a successful hash lookup of an object that's already in a set or dict, a successful identity check avoids an expensive equality check. Perhaps I misunderstood?

Here is some example code:

class FrozenOrderedCollection:
    ...
    def __eq__(self, other):
        if not isinstance(other, FrozenOrderedCollection):
            return NotImplemented
        if len(self) != len(other):
            return False
        return all(i == j for (i, j) in zip(self, other))

    def __hash__(self):
        if hasattr(self, '_hashval'):
            return self._hashval
        hash_initial = hash(self.__class__)
        self._hashval = reduce(lambda h, i: hash((h, i)), self, hash_initial)
        return self._hashval



Is it the case that internally, the code is mostly there to compute the hash of such an object in incremental fashion, and it's just a matter of figuring out the right API to expose it?

If creating an arbitrarily large tuple, passing that to hash(), and then throwing it away really is the best practice, then perhaps that should be explicitly documented, since I think many would find that counterintuitive?

@Stephen J. Turnbull, thank you for your input -- still digesting, but very interesting.

Thanks again to everyone for the helpful discussion.