<div dir="auto">You could always try to make a Python version of the C tuple hashing function[1] (requires the total # of elements) or PyPy's[2] (seems like it would allow true incremental hashing). API idea:<div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto">hasher = IncrementalHasher()</div><div dir="auto">hasher.add(one_item_to_hash)  # updates hasher.hash property with result</div><div dir="auto"># repeat</div><div dir="auto">return hasher.hash</div><div dir="auto"><br></div><div dir="auto"><div dir="auto"><br></div><div dir="auto">[1]: <a href="https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331">https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupleobject.c#l331</a><br>[2]: <a href="https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562">https://bitbucket.org/pypy/pypy/src/d8febc18447e1f785a384d52413a345d7b3db423/rpython/rlib/objectmodel.py#objectmodel.py-562</a><br><br><div data-smartmail="gmail_signature" dir="auto">--<br>Ryan (ライアン)<br>Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else<br><a href="http://kirbyfan64.github.io/">http://kirbyfan64.github.io/</a></div></div></div><br><div class="gmail_extra" dir="auto"><br><div class="gmail_quote">On Dec 27, 2016 9:15 PM,  <<a href="mailto:jab@math.brown.edu">jab@math.brown.edu</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Suppose you have implemented an immutable Position type to represent<br>

the state of a game played on an MxN board, where the board size can<br>

grow quite large.<br>

<br>

Or suppose you have implemented an immutable, ordered collection type.<br>

For example, the collections-extended package provides a<br>

frozensetlist[1]. One of my own packages provides a frozen, ordered<br>

bidirectional mapping type.[2]<br>

<br>

These types should be hashable so that they can be inserted into sets<br>

and mappings. The order-sensitivity of the contents prevents them from<br>

using the built-in collections.Set._hash() helper in their __hash__<br>

implementations, to keep from unnecessarily causing hash collisions<br>

for objects that compare unequal due only to having a different<br>

ordering of the same set of contained items.<br>

<br>

According to <a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>

:<br>

<br>

<br>

"""<br>

it is advised to mix together the hash values of the components of the<br>

object that also play a part in comparison of objects by packing them<br>

into a tuple and hashing the tuple. Example:<br>

<br>

def __hash__(self):<br>

    return hash((<a href="http://self.name" rel="noreferrer" target="_blank">self.name</a>, self.nick, self.color))<br>

<br>

"""<br>

<br>

<br>

Applying this advice to the use cases above would require creating an<br>

arbitrarily large tuple in memory before passing it to hash(), which<br>

is then just thrown away. It would be preferable if there were a way<br>

to pass multiple values to hash() in a streaming fashion, such that<br>

the overall hash were computed incrementally, without building up a<br>

large object in memory first.<br>

<br>

Should there be better support for this use case? Perhaps hash() could<br>

support an alternative signature, allowing it to accept a stream of<br>

values whose combined hash would be computed incrementally in<br>

*constant* space and linear time, e.g. "hash(items=iter(self))".<br>

<br>

In the meantime, what is the best way to incrementally compute a good<br>

hash value for such objects using built-in Python routines? (As a<br>

library author, it would be preferable to use a routine with explicit<br>

support for computing a hash incrementally, rather than having to<br>

worry about how to correctly combine results from multiple calls to<br>

hash(contained_item) in library code. (Simply XORing such results<br>

together would not be order-sensitive, and so wouldn't work.) Using a<br>

routine with explicit support for incremental hashing would allow<br>

libraries to focus on doing one thing well.[3,4,5])<br>

<br>

I know that hashlib provides algorithms that support incremental<br>

hashing, but those use at least 128 bits. Since hash() throws out<br>

anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in<br>

hashlib seems like overkill. Am I right in thinking that's the wrong<br>

tool for the job?<br>

<br>

On the other hand, would binascii.crc32 be suitable, at least for<br>

32-bit systems? (And is there some 64-bit incremental hash algorithm<br>

available for 64-bit systems? It seems Python has no support for crc64<br>

built in.) For example:<br>

<br>

import binascii, struct<br>

<br>

class FrozenOrderedCollection:<br>

    def __hash__(self):<br>

        if hasattr(self, '__hashval'):  # Computed lazily.<br>

            return self.__hashval<br>

        hv = crc32(b'<wbr>FrozenOrderedCollection')<br>

        for i in self:<br>

            hv = binascii.crc32(struct.pack('@<wbr>l', hash(i)), hv)<br>

        hv &= 0xffffffff<br>

        self.__hashval = hv<br>

        return hv<br>

<br>

<br>

Note that this example illustrates two other common requirements of<br>

these use cases:<br>

<br>

(i) lazily computing the hash value on first use, and then caching it<br>

for future use<br>

<br>

(ii) priming the overall hash value with some class-specific initial<br>

value, so that if an instance of a different type of collection, which<br>

comprised the same items but which compared unequal, were to compute<br>

its hash value out of the same constituent items, we make sure our<br>

hash value differs. (On that note, should the documentation in<br>

<a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>

quoted above be updated to add this advice? The current advice to<br>

"return hash((<a href="http://self.name" rel="noreferrer" target="_blank">self.name</a>, self.nick, self.color))" would cause a hash<br>

collision with a tuple of the same values, even though the tuple<br>

should presumably compare unequal with this object.)<br>

<br>

To summarize these questions:<br>

<br>

1. Should hash() add support for incremental hashing?<br>

<br>

2. In the meantime, what is the best way to compute a hash of a<br>

combination of many values incrementally (in constant space and linear<br>

time), using only what's available in the standard library? Ideally<br>

there is some routine available that uses exactly hash_info.hash_bits<br>

number of bits, and that does the combining of incremental results for<br>

you.<br>

<br>

3. Should the <a href="https://docs.python.org/3/reference/datamodel.html#object.__hash__" rel="noreferrer" target="_blank">https://docs.python.org/3/<wbr>reference/datamodel.html#<wbr>object.__hash__</a><br>

documentation be updated to include suitable advice for these use<br>

cases, in particular, that the overall hash value should be computed<br>

lazily, incrementally, and should be primed with a class-unique value?<br>

<br>

Thanks in advance for a helpful discussion, and best wishes.<br>

<br>

Josh<br>

<br>

<br>

References:<br>

<br>

[1] <a href="http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist" rel="noreferrer" target="_blank">http://collections-extended.<wbr>lenzm.net/api.html#<wbr>collections_extended.<wbr>frozensetlist</a><br>

<br>

[2] <a href="https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict" rel="noreferrer" target="_blank">https://bidict.readthedocs.io/<wbr>en/dev/api.html#bidict.<wbr>frozenorderedbidict</a><br>

<br>

[3] <a href="http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010" rel="noreferrer" target="_blank">http://stackoverflow.com/<wbr>questions/2909106/python-<wbr>whats-a-correct-and-good-way-<wbr>to-implement-hash#<wbr>comment28193015_19073010</a><br>

<br>

[4] <a href="http://stackoverflow.com/a/2909572/161642" rel="noreferrer" target="_blank">http://stackoverflow.com/a/<wbr>2909572/161642</a><br>

<br>

[5] <a href="http://stackoverflow.com/a/27952689/161642" rel="noreferrer" target="_blank">http://stackoverflow.com/a/<wbr>27952689/161642</a><br>

______________________________<wbr>_________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/python-ideas</a><br>

Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/<wbr>codeofconduct/</a><br>

</blockquote></div><br></div></div>