[Python-ideas] incremental hashing in __hash__
jab at math.brown.edu
jab at math.brown.edu
Sat Dec 31 17:39:41 EST 2016
On Sat, Dec 31, 2016 at 7:17 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 31 December 2016 at 15:13, <jab at math.brown.edu> wrote:
>
>> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <rosuav at gmail.com>
>> wrote:
>>
>>> How likely is it that you'll have this form of collision, rather than
>>> some other? Remember, collisions *will* happen, so don't try to eliminate
>>> them all; just try to minimize the chances of *too many* collisions. So if
>>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>>> possible cause of a collision is no worse than any other. Keep your
>>> algorithm simple, and don't sweat the details that you aren't sure matter.
>>
>>
>> In the context of writing a collections library, and not an application,
>> it should work well for a diversity of workloads that your users could
>> throw at it. In that context, it's hard to know what to do with advice like
>> this. "Heck, just hash the first three items and call it a day!"
>>
>
> Yes, this is essentially what we're suggesting you do - start with a "good
> enough" hash that may have scalability problems (e.g. due to memory
> copying) or mathematical distribution problems (e.g. due to a naive
> mathematical combination of values), and then improve it over time based on
> real world usage of the library.
>
> Alternatively, you could take the existing tuple hashing algorithm and
> reimplement that in pure Python: https://hg.python.org/cpython/
> file/tip/Objects/tupleobject.c#l336
>
> Based on microbenchmarks, you could then find the size breakpoint where it
> makes sense to switch between "hash(tuple(self))" (with memory copying, but
> a more optimised implementation of the algorithm) and a pure Python
> "tuple_hash(self)". In either case, caching the result on the instance
> would minimise the computational cost over the lifetime of the object.
>
> Cheers,
> Nick.
>
> P.S. Having realised that the tuple hash *algorithm* can be re-used
> without actually creating a tuple, I'm more amenable to the idea of
> exposing a "hash.from_iterable" callable that produces the same result as
> "hash(tuple(iterable))" without actually creating the intermediate tuple.
>
Nice!
I just realized, similar to tupleobject.c's "tuplehash" routine, I think
the frozenset_hash algorithm (implemented in setobject.c) can't be reused
without actually creating a frozenset either. As mentioned, a set hashing
algorithm is exposed as collections.Set._hash() in _collections_abc.py,
which can be passed an iterable, but that routine is implemented in Python.
So here again it seems like you have to choose between either creating an
ephemeral copy of your data so you can use the fast C routine, or streaming
your data to a slower Python implementation. At least in this case the
Python implementation is built-in though.
Given my current shortage of information, for now I'm thinking of handling
this problem in my library by exposing a parameter that users can tune if
needed. See bidict/_frozen.py in https://github.com/jab/bidict/
commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the
_HASH_NITEMS_MAX attribute.
I have to run for now, but thanks again everyone, and happy new year!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161231/200c735c/attachment.html>
More information about the Python-ideas
mailing list