[Python-ideas] incremental hashing in __hash__
Nick Coghlan
ncoghlan at gmail.com
Sat Dec 31 07:17:28 EST 2016
On 31 December 2016 at 15:13, <jab at math.brown.edu> wrote:
> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav at gmail.com> wrote:
>
>> How likely is it that you'll have this form of collision, rather than
>> some other? Remember, collisions *will* happen, so don't try to eliminate
>> them all; just try to minimize the chances of *too many* collisions. So if
>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>> possible cause of a collision is no worse than any other. Keep your
>> algorithm simple, and don't sweat the details that you aren't sure matter.
>
>
> In the context of writing a collections library, and not an application,
> it should work well for a diversity of workloads that your users could
> throw at it. In that context, it's hard to know what to do with advice like
> this. "Heck, just hash the first three items and call it a day!"
>
Yes, this is essentially what we're suggesting you do - start with a "good
enough" hash that may have scalability problems (e.g. due to memory
copying) or mathematical distribution problems (e.g. due to a naive
mathematical combination of values), and then improve it over time based on
real world usage of the library.
Alternatively, you could take the existing tuple hashing algorithm and
reimplement that in pure Python:
https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336
Based on microbenchmarks, you could then find the size breakpoint where it
makes sense to switch between "hash(tuple(self))" (with memory copying, but
a more optimised implementation of the algorithm) and a pure Python
"tuple_hash(self)". In either case, caching the result on the instance
would minimise the computational cost over the lifetime of the object.
Cheers,
Nick.
P.S. Having realised that the tuple hash *algorithm* can be re-used without
actually creating a tuple, I'm more amenable to the idea of exposing a
"hash.from_iterable" callable that produces the same result as
"hash(tuple(iterable))" without actually creating the intermediate tuple.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161231/9020a201/attachment-0001.html>
More information about the Python-ideas
mailing list