[Python-ideas] incremental hashing in __hash__

Steven D'Aprano steve at pearwood.info
Thu Dec 29 03:20:00 EST 2016


On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab at math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
> 
> > In other words, objects that do not compare equal can also have the same
> > hash value (although too much of that will reduce the efficiency of
> > Python's containers).
> >
> 
> Yes, I realize that unequal objects can return the same hash value with
> only performance, and not correctness, suffering. It's the performance I'm
> concerned about. That's what I meant by "...to keep from unnecessarily
> causing hash collisions..." in my original message, but sorry this wasn't
> clearer. We should be able to do this in a way that doesn't increase hash
> collisions unnecessarily.

With respect Josh, I feel that this thread is based on premature 
optimization. It seems to me that you're *assuming* that anything less 
than some theoretically ideal O(1) space O(N) time hash function is 
clearly and obviously unsuitable.

Of course I might be completely wrong. Perhaps you have implemented your 
own __hash__ methods as suggested by the docs, as well as Ned's version, 
and profiled your code and discovered that __hash__ is a significant 
bottleneck. In which case, I'll apologise for doubting you, but in my 
defence I'll say that the language you have used in this thread so far 
gives no hint that you've actually profiled your code and found the 
bottleneck.

As I see it, this thread includes a few questions:

(1) What is a good way to generate a hash one item at a time?

I think Ned's answer is "good enough", subject to evidence to the 
contrary. If somebody cares to spend the time to analyse it, that's 
excellent, but we're volunteers and its the holiday period and most 
people have probably got better things to do. But we shouldn't let the 
perfect be the enemy of the good.

But for what it's worth, I've done an *extremely* quick and dirty test 
to see whether the incremental hash function gives a good spread of 
values, by comparing it to the standard hash() function.


py> import statistics
py> def incrhash(iterable):
...     h = hash(())
...     for x in iterable:
...             h = hash((h, x))
...     return h
...
py>
py> data1 = []
py> data2 = []
py> for i in range(1000):
...     it = range(i, i+100)
...     data1.append(hash(tuple(it)))
...     data2.append(incrhash(it))
...
py> # Are there any collisions?
... len(set(data1)), len(set(data2))
(1000, 1000)
py> # compare spread of values
... statistics.stdev(data1), statistics.stdev(data2)
(1231914201.0980585, 1227850884.443638)
py> max(data1)-min(data1), max(data2)-min(data2)
(4287398438, 4287569008)


Neither the built-in hash() nor the incremental hash gives any 
collisions over this (admittedly small) data set, and both have very 
similar spreads of values as measured by either the standard deviation 
or the statistical range. The means are quite different:

py> statistics.mean(data1), statistics.mean(data2)
(-8577110.944, 2854438.568)

but I don't think that matters. So that's good enough for me.


(2) Should Ned's incremental hash, or some alternative with better 
properties, go into the standard library? 

I'm not convinced that your examples are common enough that the stdlib 
should be burdened with supporting it. On the other hand, I don't think 
it is an especially *large* burden, so perhaps it could be justified. 
Count me as sitting on the fence on this one.

Perhaps a reasonable compromise would be to include it as a recipe in 
the docs.


(3) If it does go in the stdlib, where should it go?

I'm suspicious of functions that change their behaviour depending on how 
they are called, so I'm not keen on your suggestion of adding a second 
API to the hash built-in:

    hash(obj)  # return hash of obj

    hash(iterable=obj)  # return incrementally calculated hash of obj

That feels wrong to me. I'd rather add a generator to the itertools 
module:

    itertools.iterhash(iterable)  # yield incremental hashes

or, copying the API of itertools.chain, add a method to hash:

    hash.from_iterable(iterable)  # return hash calculated incrementally



-- 
Steve


More information about the Python-ideas mailing list