[Python-ideas] incremental hashing in __hash__
jab at math.brown.edu
jab at math.brown.edu
Sat Dec 31 00:13:07 EST 2016
On Fri, Dec 30, 2016 at 10:30 PM, Nathaniel Smith <njs at pobox.com> wrote:
> ...
"Most hash schemes depend on having a "good" hash function, in the sense of
> simulating randomness. Python doesn't ..."
https://github.com/python/cpython/blob/d0a2f68a/Objects/dictobject.c#L133
...
Thanks for that link, fascinating to read the rest of that comment!!
Btw, the part you quoted seemed like more a defense for what followed, i.e.
the choice to make hash(some_int) == some_int. I'm not sure how much the
part you quoted applies generally. e.g. The https://docs.python.org/3/
reference/datamodel.html#object.__hash__ docs don't say, "Don't worry about
your __hash__ implementation, dict's collision resolution strategy is so
good it probably doesn't matter anyway."
But it also doesn't have any discussion of any of the tradeoffs you
mentioned that might be worth considering. What should you do when there
are arbitrarily many "components of the object that play a part in
comparison of objects"? The "hash((self._name, self._nick, self._color))"
code snippet is the only example it gives. This leaves people who have use
cases like mine wondering whether it's still advised to scale this up to
the arbitrarily many items that instances of their class can contain.
If not, then what is advised? Passing a tuple of fewer items to a single
hash() call, e.g. hash(tuple(islice(self, CUTOFF)))? Ned's recipe of
pairwise-accumulating hash() results over all the items? Or only
pairwise-accumulating up to a certain cutoff? Stephen J. Turnbull's idea to
use fewer accumulation steps and larger-than-2-tuples? Passing all the
items into some other cheap, built-in hash algorithm that actually has an
incremental update API (crc32?)?
Still hoping someone can give some authoritative advice, and hope it's
still reasonable to be asking. If not, I'll cut it out.
On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav at gmail.com> wrote:
> How likely is it that you'll have this form of collision, rather than some
> other? Remember, collisions *will* happen, so don't try to eliminate them
> all; just try to minimize the chances of *too many* collisions. So if
> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
> and (3,1,2), then sure, you need to care about order; but otherwise, one
> possible cause of a collision is no worse than any other. Keep your
> algorithm simple, and don't sweat the details that you aren't sure matter.
In the context of writing a collections library, and not an application, it
should work well for a diversity of workloads that your users could throw
at it. In that context, it's hard to know what to do with advice like this.
"Heck, just hash the first three items and call it a day!"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161231/1ad0dede/attachment-0001.html>
More information about the Python-ideas
mailing list