[Python-ideas] incremental hashing in __hash__
jab at math.brown.edu
jab at math.brown.edu
Tue Dec 27 22:13:59 EST 2016
Suppose you have implemented an immutable Position type to represent
the state of a game played on an MxN board, where the board size can
grow quite large.
Or suppose you have implemented an immutable, ordered collection type.
For example, the collections-extended package provides a
frozensetlist[1]. One of my own packages provides a frozen, ordered
bidirectional mapping type.[2]
These types should be hashable so that they can be inserted into sets
and mappings. The order-sensitivity of the contents prevents them from
using the built-in collections.Set._hash() helper in their __hash__
implementations, to keep from unnecessarily causing hash collisions
for objects that compare unequal due only to having a different
ordering of the same set of contained items.
According to https://docs.python.org/3/reference/datamodel.html#object.__hash__
:
"""
it is advised to mix together the hash values of the components of the
object that also play a part in comparison of objects by packing them
into a tuple and hashing the tuple. Example:
def __hash__(self):
return hash((self.name, self.nick, self.color))
"""
Applying this advice to the use cases above would require creating an
arbitrarily large tuple in memory before passing it to hash(), which
is then just thrown away. It would be preferable if there were a way
to pass multiple values to hash() in a streaming fashion, such that
the overall hash were computed incrementally, without building up a
large object in memory first.
Should there be better support for this use case? Perhaps hash() could
support an alternative signature, allowing it to accept a stream of
values whose combined hash would be computed incrementally in
*constant* space and linear time, e.g. "hash(items=iter(self))".
In the meantime, what is the best way to incrementally compute a good
hash value for such objects using built-in Python routines? (As a
library author, it would be preferable to use a routine with explicit
support for computing a hash incrementally, rather than having to
worry about how to correctly combine results from multiple calls to
hash(contained_item) in library code. (Simply XORing such results
together would not be order-sensitive, and so wouldn't work.) Using a
routine with explicit support for incremental hashing would allow
libraries to focus on doing one thing well.[3,4,5])
I know that hashlib provides algorithms that support incremental
hashing, but those use at least 128 bits. Since hash() throws out
anything beyond sys.hash_info.hash_bits (e.g. 64) bits, anything in
hashlib seems like overkill. Am I right in thinking that's the wrong
tool for the job?
On the other hand, would binascii.crc32 be suitable, at least for
32-bit systems? (And is there some 64-bit incremental hash algorithm
available for 64-bit systems? It seems Python has no support for crc64
built in.) For example:
import binascii, struct
class FrozenOrderedCollection:
def __hash__(self):
if hasattr(self, '__hashval'): # Computed lazily.
return self.__hashval
hv = crc32(b'FrozenOrderedCollection')
for i in self:
hv = binascii.crc32(struct.pack('@l', hash(i)), hv)
hv &= 0xffffffff
self.__hashval = hv
return hv
Note that this example illustrates two other common requirements of
these use cases:
(i) lazily computing the hash value on first use, and then caching it
for future use
(ii) priming the overall hash value with some class-specific initial
value, so that if an instance of a different type of collection, which
comprised the same items but which compared unequal, were to compute
its hash value out of the same constituent items, we make sure our
hash value differs. (On that note, should the documentation in
https://docs.python.org/3/reference/datamodel.html#object.__hash__
quoted above be updated to add this advice? The current advice to
"return hash((self.name, self.nick, self.color))" would cause a hash
collision with a tuple of the same values, even though the tuple
should presumably compare unequal with this object.)
To summarize these questions:
1. Should hash() add support for incremental hashing?
2. In the meantime, what is the best way to compute a hash of a
combination of many values incrementally (in constant space and linear
time), using only what's available in the standard library? Ideally
there is some routine available that uses exactly hash_info.hash_bits
number of bits, and that does the combining of incremental results for
you.
3. Should the https://docs.python.org/3/reference/datamodel.html#object.__hash__
documentation be updated to include suitable advice for these use
cases, in particular, that the overall hash value should be computed
lazily, incrementally, and should be primed with a class-unique value?
Thanks in advance for a helpful discussion, and best wishes.
Josh
References:
[1] http://collections-extended.lenzm.net/api.html#collections_extended.frozensetlist
[2] https://bidict.readthedocs.io/en/dev/api.html#bidict.frozenorderedbidict
[3] http://stackoverflow.com/questions/2909106/python-whats-a-correct-and-good-way-to-implement-hash#comment28193015_19073010
[4] http://stackoverflow.com/a/2909572/161642
[5] http://stackoverflow.com/a/27952689/161642
More information about the Python-ideas
mailing list