Tim Peters wrote:
> Sorry! A previous attempt to reply got sent before I typed anything :-(
No worries, I only saw the link in the footer to the PSF CoC, and I was mildly concerned for a moment. My first thought was "Oh no, what did I do?". Thanks for clearing that up (:
> The collision resolution strategy for sets evolved to be fancier than
> for dicts, to reduce cache misses. This is important for sets because
> the _only_ interesting thing about an element wrt a set is whether or
> not the set contains it. Lookup speed is everything for sets.
Huh, that's quite interesting. For some reason, I had assumed in the back of my head (without giving it much thought) that the average collision rate would be the same for set items and dict keys. Thanks for the useful information.
> If you use a contiguous range of "reasonable" integers for keys, the
> integer hash function is perfect: there's never a collision. So any
> such test misses all the work Raymond did to speed set lookups.
> String keys have sufficiently "random" hashes to reliably create
> collisions, though. Cache misses can, of course, have massive effects
> on timing.
Ah, I forgot to consider how the hash function actually works for continuous integers. A better comparison to demonstrate the collision differences would likely use random strings.
Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), as defined in Include/pyhash.h (
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pyhash.h#L28).
> In the former case you're primarily measuring the time to look up the
> "add" method of sets (that's more expensive than adding 0 to the set).
> A better comparison would, e.g., move `add = s.add` to a setup line,
> and use plain "add(0)" in the loop.
Good point, there's also what Chris Angelico pointed out as well: dicts have a significant advantage in this case for having a literal for initialization. From my understanding, this ends up having a minimal performance impact in most realistic code (similar to method lookup time), but it makes a very substantial difference in this case since initialization of the containers takes up a significant portion of each iteration.
I suspect my initialization comparison might also be flawed for similar reasons, but I'm glad that at least 3/5 of my comparisons were mostly reasonable. So far, the performance difference between set.update() and dict.update() were the most notable.
I'll likely spend some more time experimenting in the next couple of weeks or so, and report back if I find any interesting results. In the meantime, anyone can certainly feel free to improve upon my existing comparisons.