[Python-ideas] incremental hashing in __hash__
Ned Batchelder
ned at nedbatchelder.com
Wed Dec 28 16:27:07 EST 2016
On 12/28/16 12:27 PM, jab at math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder
> <ned at nedbatchelder.com <mailto:ned at nedbatchelder.com>> wrote:
>
> You can write a simple function to use hash iteratively to hash
> the entire stream in constant space and linear time:
>
> def hash_stream(them):
> val = 0
> for it in them:
> val = hash((val, it))
> return val
>
> Although this creates N 2-tuples, they come and go, so the memory
> use won't grow. Adjust the code as needed to achieve
> canonicalization before iterating.
>
> Or maybe I am misunderstanding the requirements?
>
>
> This is better than solutions like
> http://stackoverflow.com/a/27952689/161642
> <http://stackoverflow.com/a/27952689/161642> in the sense that it's a
> little higher level (no bit shifting or magic numbers).
>
> But it's not clear that it's any better in the sense that you're still
> rolling your own incremental hash algorithm out of a lower-level
> primitive that doesn't document support for this, and therefore taking
> responsibility yourself for how well it distributes values into buckets.
>
> Are you confident this results in good hash performance? Is this
> better than a solution built on top of a hash function with an
> explicit API for calculating a hash incrementally, such as the crc32
> example I included? (And again, this would ideally be
> a sys.hash_info.hash_bits -bit algorithm.)
I don't have the theoretical background to defend this function. But it
seems to me that if we believe that hash((int, thing)) distributes well,
then how could this function not distribute well?
--Ned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161228/674483c1/attachment.html>
More information about the Python-ideas
mailing list