<div dir="ltr">Thanks for the thoughtful discussion, it's been very interesting.<div><br></div><div>Hash algorithms seem particularly sensitive and tricky to get right, with a great deal of research going into choices of constants, etc. and lots of gotchas. So it seemed worth asking about. If I had to bet on whether repeatedly accumulating pairwise hash() results would maintain the same desired properties that hash(tuple(items)) guarantees, I'd want to get confirmation from someone with expertise in this first, hence my starting this thread.</div><div><br></div><div>But as you showed, it's certainly possible to do some exploration in the meantime. Prompted by your helpful comparison, I just put together <a href="https://gist.github.com/jab/fd78b3acd25b3530e0e21f5aaee3c674" target="_blank">https://gist.github.com/jab/<wbr>fd78b3acd25b3530e0e21f5aaee3c6<wbr>74</a> to further compare hash_tuple vs. hash_incremental.</div><div><br></div><div>Based on that, hash_incremental seems to have a comparable distribution to hash_tuple. I'm not sure if the methodology there is sound, as I'm new to analysis like this. So I'd still welcome confirmation from someone with actual expertise in Python's internal hash algorithms. But so far this sure seems good enough for the use cases I described.</div><div><br></div><div>Given sufficiently good distribution, I'd expect there to be unanimous agreement that an immutable collection, which could contain arbitrarily many items, should strongly prefer hash_incremental(self) over hash(tuple(self)), for the same reason we use generator comprehensions instead of list comprehensions when appropriate. Please correct me if I'm wrong.</div><div><br></div><div>+1 for the "hash.from_iterable" API you proposed, if some additional support for this is added to Python. Otherwise +1 for including Ned's recipe in the docs. Again, happy to submit a patch for either of these if it would be helpful.</div><div><br></div><div>And to be clear, I really appreciate the time that contributors have put into this thread, and into Python in general. Thoughtful responses are always appreciated, and never expected. I'm just interested in learning and in helping improve Python when I might have an opportunity. My Python open source work has been done on a voluntary basis too, and I haven't even gotten to use Python for paid/closed source work in several years, alas.</div><div><br></div><div>Thanks again,<br>Josh<br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 29, 2016 at 3:20 AM, Steven D'Aprano <span dir="ltr"><<a href="mailto:steve@pearwood.info" target="_blank">steve@pearwood.info</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_9207376106917616085gmail-HOEnZb"><div class="m_9207376106917616085gmail-h5">On Wed, Dec 28, 2016 at 12:44:55PM -0500, <a href="mailto:jab@math.brown.edu" target="_blank">jab@math.brown.edu</a> wrote:<br>

> On Wed, Dec 28, 2016 at 12:10 PM, Ethan Furman <<a href="mailto:ethan@stoneleaf.us" target="_blank">ethan@stoneleaf.us</a>> wrote:<br>

><br>

> > In other words, objects that do not compare equal can also have the same<br>

> > hash value (although too much of that will reduce the efficiency of<br>

> > Python's containers).<br>

> ><br>

><br>

> Yes, I realize that unequal objects can return the same hash value with<br>

> only performance, and not correctness, suffering. It's the performance I'm<br>

> concerned about. That's what I meant by "...to keep from unnecessarily<br>

> causing hash collisions..." in my original message, but sorry this wasn't<br>

> clearer. We should be able to do this in a way that doesn't increase hash<br>

> collisions unnecessarily.<br>

<br>

</div></div>With respect Josh, I feel that this thread is based on premature<br>

optimization. It seems to me that you're *assuming* that anything less<br>

than some theoretically ideal O(1) space O(N) time hash function is<br>

clearly and obviously unsuitable.<br>

<br>

Of course I might be completely wrong. Perhaps you have implemented your<br>

own __hash__ methods as suggested by the docs, as well as Ned's version,<br>

and profiled your code and discovered that __hash__ is a significant<br>

bottleneck. In which case, I'll apologise for doubting you, but in my<br>

defence I'll say that the language you have used in this thread so far<br>

gives no hint that you've actually profiled your code and found the<br>

bottleneck.<br>

<br>

As I see it, this thread includes a few questions:<br>

<br>

(1) What is a good way to generate a hash one item at a time?<br>

<br>

I think Ned's answer is "good enough", subject to evidence to the<br>

contrary. If somebody cares to spend the time to analyse it, that's<br>

excellent, but we're volunteers and its the holiday period and most<br>

people have probably got better things to do. But we shouldn't let the<br>

perfect be the enemy of the good.<br>

<br>

But for what it's worth, I've done an *extremely* quick and dirty test<br>

to see whether the incremental hash function gives a good spread of<br>

values, by comparing it to the standard hash() function.<br>

<br>

<br>

py> import statistics<br>

py> def incrhash(iterable):<br>

...     h = hash(())<br>

...     for x in iterable:<br>

...             h = hash((h, x))<br>

...     return h<br>

...<br>

py><br>

py> data1 = []<br>

py> data2 = []<br>

py> for i in range(1000):<br>

...     it = range(i, i+100)<br>

...     data1.append(hash(tuple(it)))<br>

...     data2.append(incrhash(it))<br>

...<br>

py> # Are there any collisions?<br>

... len(set(data1)), len(set(data2))<br>

(1000, 1000)<br>

py> # compare spread of values<br>

... statistics.stdev(data1), statistics.stdev(data2)<br>

(1231914201.0980585, 1227850884.443638)<br>

py> max(data1)-min(data1), max(data2)-min(data2)<br>

(4287398438, 4287569008)<br>

<br>

<br>

Neither the built-in hash() nor the incremental hash gives any<br>

collisions over this (admittedly small) data set, and both have very<br>

similar spreads of values as measured by either the standard deviation<br>

or the statistical range. The means are quite different:<br>

<br>

py> statistics.mean(data1), statistics.mean(data2)<br>

(-8577110.944, 2854438.568)<br>

<br>

but I don't think that matters. So that's good enough for me.<br>

<br>

<br>

(2) Should Ned's incremental hash, or some alternative with better<br>

properties, go into the standard library?<br>

<br>

I'm not convinced that your examples are common enough that the stdlib<br>

should be burdened with supporting it. On the other hand, I don't think<br>

it is an especially *large* burden, so perhaps it could be justified.<br>

Count me as sitting on the fence on this one.<br>

<br>

Perhaps a reasonable compromise would be to include it as a recipe in<br>

the docs.<br>

<br>

<br>

(3) If it does go in the stdlib, where should it go?<br>

<br>

I'm suspicious of functions that change their behaviour depending on how<br>

they are called, so I'm not keen on your suggestion of adding a second<br>

API to the hash built-in:<br>

<br>

    hash(obj)  # return hash of obj<br>

<br>

    hash(iterable=obj)  # return incrementally calculated hash of obj<br>

<br>

That feels wrong to me. I'd rather add a generator to the itertools<br>

module:<br>

<br>

    itertools.iterhash(iterable)  # yield incremental hashes<br>

<br>

or, copying the API of itertools.chain, add a method to hash:<br>

<br>

    hash.from_iterable(iterable)  # return hash calculated incrementally<br>

<span class="m_9207376106917616085gmail-HOEnZb"><font color="#888888"><br>

<br>

<br>

--<br>

Steve</font></span></blockquote></div></div></div></div>