<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 31 December 2016 at 15:13, <span dir="ltr"><<a href="mailto:jab@math.brown.edu" target="_blank">jab@math.brown.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><span class="gmail-"><div>On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico <<a href="mailto:rosuav@gmail.com" target="_blank">rosuav@gmail.com</a>> wrote:</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">How likely is it that you'll have this form of collision, rather than some other? Remember, collisions *will* happen, so don't try to eliminate them all; just try to minimize the chances of *too many* collisions. So if you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3) and (3,1,2), then sure, you need to care about order; but otherwise, one possible cause of a collision is no worse than any other. Keep your algorithm simple, and don't sweat the details that you aren't sure matter.</blockquote><div><br></div></span><div>In the context of writing a collections library, and not an application, it should work well for a diversity of workloads that your users could throw at it. In that context, it's hard to know what to do with advice like this. "Heck, just hash the first three items and call it a day!"</div></div></div></div></div></blockquote><div><br></div><div>Yes, this is essentially what we're suggesting you do - start with a "good enough" hash that may have scalability problems (e.g. due to memory copying) or mathematical distribution problems (e.g. due to a naive mathematical combination of values), and then improve it over time based on real world usage of the library. <br></div><br><div>Alternatively, you could take the existing tuple hashing algorithm and reimplement that in pure Python: <a href="https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336">https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336</a><br><br></div><div>Based on microbenchmarks, you could then find the size breakpoint where it makes sense to switch between "hash(tuple(self))" (with memory copying, but a more optimised implementation of the algorithm) and a pure Python "tuple_hash(self)". In either case, caching the result on the instance would minimise the computational cost over the lifetime of the object.<br></div></div><br></div><div class="gmail_extra">Cheers,<br></div><div class="gmail_extra">Nick.<br><br></div><div class="gmail_extra">P.S. Having realised that the tuple hash *algorithm* can be re-used without actually creating a tuple, I'm more amenable to the idea of exposing a "hash.from_iterable" callable that produces the same result as "hash(tuple(iterable))" without actually creating the intermediate tuple.<br></div><div class="gmail_extra"><br>-- <br><div class="gmail_signature">Nick Coghlan | <a href="mailto:ncoghlan@gmail.com" target="_blank">ncoghlan@gmail.com</a> | Brisbane, Australia</div>
</div></div>