[Python-ideas] incremental hashing in __hash__
Steven D'Aprano
steve at pearwood.info
Thu Dec 29 03:20:00 EST 2016
On Wed, Dec 28, 2016 at 12:44:55PM -0500, jab at math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
>
> > In other words, objects that do not compare equal can also have the same
> > hash value (although too much of that will reduce the efficiency of
> > Python's containers).
> >
>
> Yes, I realize that unequal objects can return the same hash value with
> only performance, and not correctness, suffering. It's the performance I'm
> concerned about. That's what I meant by "...to keep from unnecessarily
> causing hash collisions..." in my original message, but sorry this wasn't
> clearer. We should be able to do this in a way that doesn't increase hash
> collisions unnecessarily.
With respect Josh, I feel that this thread is based on premature
optimization. It seems to me that you're *assuming* that anything less
than some theoretically ideal O(1) space O(N) time hash function is
clearly and obviously unsuitable.
Of course I might be completely wrong. Perhaps you have implemented your
own __hash__ methods as suggested by the docs, as well as Ned's version,
and profiled your code and discovered that __hash__ is a significant
bottleneck. In which case, I'll apologise for doubting you, but in my
defence I'll say that the language you have used in this thread so far
gives no hint that you've actually profiled your code and found the
bottleneck.
As I see it, this thread includes a few questions:
(1) What is a good way to generate a hash one item at a time?
I think Ned's answer is "good enough", subject to evidence to the
contrary. If somebody cares to spend the time to analyse it, that's
excellent, but we're volunteers and its the holiday period and most
people have probably got better things to do. But we shouldn't let the
perfect be the enemy of the good.
But for what it's worth, I've done an *extremely* quick and dirty test
to see whether the incremental hash function gives a good spread of
values, by comparing it to the standard hash() function.
py> import statistics
py> def incrhash(iterable):
... h = hash(())
... for x in iterable:
... h = hash((h, x))
... return h
...
py>
py> data1 = []
py> data2 = []
py> for i in range(1000):
... it = range(i, i+100)
... data1.append(hash(tuple(it)))
... data2.append(incrhash(it))
...
py> # Are there any collisions?
... len(set(data1)), len(set(data2))
(1000, 1000)
py> # compare spread of values
... statistics.stdev(data1), statistics.stdev(data2)
(1231914201.0980585, 1227850884.443638)
py> max(data1)-min(data1), max(data2)-min(data2)
(4287398438, 4287569008)
Neither the built-in hash() nor the incremental hash gives any
collisions over this (admittedly small) data set, and both have very
similar spreads of values as measured by either the standard deviation
or the statistical range. The means are quite different:
py> statistics.mean(data1), statistics.mean(data2)
(-8577110.944, 2854438.568)
but I don't think that matters. So that's good enough for me.
(2) Should Ned's incremental hash, or some alternative with better
properties, go into the standard library?
I'm not convinced that your examples are common enough that the stdlib
should be burdened with supporting it. On the other hand, I don't think
it is an especially *large* burden, so perhaps it could be justified.
Count me as sitting on the fence on this one.
Perhaps a reasonable compromise would be to include it as a recipe in
the docs.
(3) If it does go in the stdlib, where should it go?
I'm suspicious of functions that change their behaviour depending on how
they are called, so I'm not keen on your suggestion of adding a second
API to the hash built-in:
hash(obj) # return hash of obj
hash(iterable=obj) # return incrementally calculated hash of obj
That feels wrong to me. I'd rather add a generator to the itertools
module:
itertools.iterhash(iterable) # yield incremental hashes
or, copying the API of itertools.chain, add a method to hash:
hash.from_iterable(iterable) # return hash calculated incrementally
--
Steve
More information about the Python-ideas
mailing list