[Python-ideas] incremental hashing in __hash__

Ned Batchelder ned at nedbatchelder.com
Wed Dec 28 16:27:07 EST 2016


On 12/28/16 12:27 PM, jab at math.brown.edu wrote:
> On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder
> <ned at nedbatchelder.com <mailto:ned at nedbatchelder.com>> wrote:
>
>     You can write a simple function to use hash iteratively to hash
>     the entire stream in constant space and linear time:
>
>         def hash_stream(them):
>             val = 0
>             for it in them:
>                 val = hash((val, it))
>             return val
>
>     Although this creates N 2-tuples, they come and go, so the memory
>     use won't grow.  Adjust the code as needed to achieve
>     canonicalization before iterating.
>
>     Or maybe I am misunderstanding the requirements?
>
>
> This is better than solutions like
> http://stackoverflow.com/a/27952689/161642
> <http://stackoverflow.com/a/27952689/161642> in the sense that it's a
> little higher level (no bit shifting or magic numbers).
>
> But it's not clear that it's any better in the sense that you're still
> rolling your own incremental hash algorithm out of a lower-level
> primitive that doesn't document support for this, and therefore taking
> responsibility yourself for how well it distributes values into buckets.
>
> Are you confident this results in good hash performance? Is this
> better than a solution built on top of a hash function with an
> explicit API for calculating a hash incrementally, such as the crc32
> example I included? (And again, this would ideally be
> a sys.hash_info.hash_bits -bit algorithm.)

I don't have the theoretical background to defend this function.  But it
seems to me that if we believe that hash((int, thing)) distributes well,
then how could this function not distribute well?

--Ned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161228/674483c1/attachment.html>


More information about the Python-ideas mailing list