<div dir="auto">You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea:<div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">hasher = IncrementalHasher()</div><div dir="auto">hasher.add(one_item_to_hash) # updates hasher.hash property with result</div><div dir="auto"># repeat</div><div dir="auto">return hasher.hash</div><div dir="auto"><br></div><div dir="auto"><div dir="auto"><br></div><div dir="auto">[1]: <a href="https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331">https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331</a><br>[2]: <a href="https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562">https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562</a><br><br><div data-smartmail="gmail_signature" dir="auto">--<br>Ryan (ライアン)<br>Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else<br><a href="http://kirbyfan64.github.io/">http://kirbyfan64.github.io/</a></div></div></div><br><div class="gmail_extra" dir="auto"><br><div class="gmail_quote">On Dec 27, 2016 9:15 PM, <<a href="mailto:jab@math.brown.edu">jab@math.brown.edu</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Suppose you have implemented an immutable Position type to represent<br>
the state of a game played on an MxN board, where the board size can<br>
grow quite large.<br>
<br>
Or suppose you have implemented an immutable, ordered collection type.<br>
For example, the collections-extended package provides a<br>
frozensetlist[1]. One of my own packages provides a frozen, ordered<br>
bidirectional mapping type.[2]<br>
<br>
These types should be hashable so that they can be inserted into sets<br>
and mappings. The order-sensitivity of the contents prevents them from<br>
using the built-in collections.Set._hash() helper in their __hash__<br>
implementations, to keep from unnecessarily causing hash collisions<br>
for objects that compare unequal due only to having a different<br>
ordering of the same set of contained items.<br>
<br>
According to <a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>
:<br>
<br>
<br>
"""<br>
it is advised to mix together the hash values of the components of the<br>
object that also play a part in comparison of objects by packing them<br>
into a tuple and hashing the tuple. Example:<br>
<br>
def __hash__(self):<br>
return hash((<a href="http://self.name" rel="noreferrer" target="_blank">self.name</a>, self.nick, self.color))<br>
<br>
"""<br>
<br>
<br>
Applying this advice to the use cases above would require creating an<br>
arbitrarily large tuple in memory before passing it to hash(), which<br>
is then just thrown away. It would be preferable if there were a way<br>
to pass multiple values to hash() in a streaming fashion, such that<br>
the overall hash were computed incrementally, without building up a<br>
large object in memory first.<br>
<br>
Should there be better support for this use case? Perhaps hash() could<br>
support an alternative signature, allowing it to accept a stream of<br>
values whose combined hash would be computed incrementally in<br>
*constant* space and linear time, e.g. "hash(items=iter(self))".<br>
<br>
In the meantime, what is the best way to incrementally compute a good<br>
hash value for such objects using built-in Python routines? (As a<br>
library author, it would be preferable to use a routine with explicit<br>
support for computing a hash incrementally, rather than having to<br>
worry about how to correctly combine results from multiple calls to<br>
hash(contained_item) in library code. (Simply XORing such results<br>
together would not be order-sensitive, and so wouldn't work.) Using a<br>
routine with explicit support for incremental hashing would allow<br>
libraries to focus on doing one thing well.[3,4,5])<br>
<br>
I know that hashlib provides algorithms that support incremental<br>
hashing, but those use at least 128 bits. Since hash() throws out<br>
anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in<br>
hashlib seems like overkill. Am I right in thinking that's the wrong<br>
tool for the job?<br>
<br>
On the other hand, would binascii.crc32 be suitable, at least for<br>
32-bit systems? (And is there some 64-bit incremental hash algorithm<br>
available for 64-bit systems? It seems Python has no support for crc64<br>
built in.) For example:<br>
<br>
import binascii, struct<br>
<br>
class FrozenOrderedCollection:<br>
def __hash__(self):<br>
if hasattr(self, '__hashval'): # Computed lazily.<br>
return self.__hashval<br>
hv = crc32(b'<wbr>FrozenOrderedCollection')<br>
for i in self:<br>
hv = binascii.crc32(struct.pack('@<wbr>l', hash(i)), hv)<br>
hv &= 0xffffffff<br>
self.__hashval = hv<br>
return hv<br>
<br>
<br>
Note that this example illustrates two other common requirements of<br>
these use cases:<br>
<br>
(i) lazily computing the hash value on first use, and then caching it<br>
for future use<br>
<br>
(ii) priming the overall hash value with some class-specific initial<br>
value, so that if an instance of a different type of collection, which<br>
comprised the same items but which compared unequal, were to compute<br>
its hash value out of the same constituent items, we make sure our<br>
hash value differs. (On that note, should the documentation in<br>
<a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>
quoted above be updated to add this advice? The current advice to<br>
"return hash((<a href="http://self.name" rel="noreferrer" target="_blank">self.name</a>, self.nick, self.color))" would cause a hash<br>
collision with a tuple of the same values, even though the tuple<br>
should presumably compare unequal with this object.)<br>
<br>
To summarize these questions:<br>
<br>
1. Should hash() add support for incremental hashing?<br>
<br>
2. In the meantime, what is the best way to compute a hash of a<br>
combination of many values incrementally (in constant space and linear<br>
time), using only what's available in the standard library? Ideally<br>
there is some routine available that uses exactly hash_info.hash_bits<br>
number of bits, and that does the combining of incremental results for<br>
you.<br>
<br>
3. Should the <a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>
documentation be updated to include suitable advice for these use<br>
cases, in particular, that the overall hash value should be computed<br>
lazily, incrementally, and should be primed with a class-unique value?<br>
<br>
Thanks in advance for a helpful discussion, and best wishes.<br>
<br>
Josh<br>
<br>
<br>
References:<br>
<br>
[1] <a href="http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist" rel="noreferrer" target="_blank">http://collections-extended.<wbr>lenzm.net/api.html#<wbr>collections_extended.<wbr>frozensetlist</a><br>
<br>
[2] <a href="https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict" rel="noreferrer" target="_blank">https://bidict.readthedocs.io/<wbr>en/dev/api.html#bidict.<wbr>frozenorderedbidict</a><br>
<br>
[3] <a href="http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010" rel="noreferrer" target="_blank">http://stackoverflow.com/<wbr>questions/2909106/python-<wbr>whats-a-correct-and-good-way-<wbr>to-implement-hash#<wbr>comment28193015_19073010</a><br>
<br>
[4] <a href="http://stackoverflow.com/a/2909572/161642" rel="noreferrer" target="_blank">http://stackoverflow.com/a/<wbr>2909572/161642</a><br>
<br>
[5] <a href="http://stackoverflow.com/a/27952689/161642" rel="noreferrer" target="_blank">http://stackoverflow.com/a/<wbr>27952689/161642</a><br>
______________________________<wbr>_________________<br>
Python-ideas mailing list<br>
<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/python-ideas</a><br>
Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/<wbr>codeofconduct/</a><br>
</blockquote></div><br></div></div>