[Python-ideas] incremental hashing in __hash__

Christian Heimes christian at python.org
Fri Dec 30 15:54:59 EST 2016


On 2016-12-30 20:59, Matthias Bussonnier wrote:
> On Fri, Dec 30, 2016 at 5:24 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>>
>> I understood the "iterhash" suggestion to be akin to itertools.accumulate:
>>
>>     >>> for value, tally in enumerate(accumulate(range(10))): print(value, ...
> 
> It reminds me of hmac[1]/hashlib[2], with the API :  h.update(...)
> before a .digest().
> It is slightly lower level than a `from_iterable`, but would be a bit
> more flexible.
> If the API were kept similar things would be easier to remember.

Hi,

I'm the author of PEP 456 (SipHash24) and one of the maintainers of the
hashlib module.

Before we come up with a new API or recipe, I would like to understand
the problem first. Why does the initial op consider hash(large_tuple) a
performance issue? If you have an object with lots of members that
affect both __hash__ and __eq__, then __hash__ is really least of your
concern. The hash has to be computed just once and then will stay the
same over the life time of the object. Once computed the hash can be
cached.

On the other hand __eq__ is at least called once for every successful
hash lookup. On the worst case it is called n-1 for a dict of size n for
a match *and* a hashmap miss. Every __eq__ call has to compare between 1
and m member attributes. For a dict with 1,000 elements with 1,000
members each, that's just 1,000 hash computations with roughly 8 bB
memory allocation but almost a million comparisons in worst case.

A hasher objects adds further overhead, e.g. object allocation, creation
of a bound methods for each call etc. It's also less CPU cache friendly
than the linear data structure of a tuple.

Christian



More information about the Python-ideas mailing list