geremy condra wrote:
On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky
wrote: On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser
wrote: .. and "x = iter(s).next()" raises a StopIteration exception. And that's why the documented recipe should probably recommend next(iter(s), default) instead. Especially because iter(s).next() is not even valid code in 3.0.
This seems reasonably legible to you? Strikes me as coding by incantation. Also, while I've heard people say that the naive approach is slower, I'm not getting that result. Here's my test:
smrt = timeit.Timer("next(iter(s))", "s=set(range(100))") smrt.repeat(10) [1.2845709323883057, 0.60247397422790527, 0.59621405601501465, 0.59133195877075195, 0.58387589454650879, 0.56839084625244141, 0.56839680671691895, 0.56877803802490234, 0.56905913352966309, 0.56846404075622559]
naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))") naive.repeat(10) [0.93139314651489258, 0.53566789627075195, 0.53674602508544922, 0.53608798980712891, 0.53634309768676758, 0.53557991981506348, 0.53578495979309082, 0.53666114807128906, 0.53576493263244629, 0.53491711616516113]
Perhaps my test is flawed in some way?
Geremy Condra
Well, it also uses a fairly trivial set. 'set(range(100))' is going to generally have 0 collisions and everything will hash to a unique bucket. As such, pop ing an item out of the hash is a single "val = table[int & mask]; table[int & mask] = _dummy", and then looking it up again requires 2 table lookups (one finds the _dummy, the next finds that the chain is broken and can rewrite the _dummy.) However, if a set is more full, or has more collisions, or ... then pop() and add() become relatively more expensive. Surprising to me, is that "next(iter(s))" was actually slower than .pop() + .add() for 100 node set in my testing: $ alias TIMEIT="python -m timeit -s 's = set(range(100)'" $ TIMEIT "x = next(iter(s))" 1000000 loops, best of 3: 0.263 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 1000000 loops, best of 3: 0.217 usec per loop though both are much slower than the fastest we've found: $ TIMEIT "for x in s: break" 10000000 loops, best of 3: 0.0943 usec per loop now, what about a set with *lots* of collisions. Create 100 keys that all hash to the same bucket: aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in range(100)])'" $ TIMEIT "x = next(iter(s))" 1000000 loops, best of 3: 0.257 usec per loop $ TIMEIT "x = s.pop(); s.add(x)" 1000000 loops, best of 3: 0.218 usec per loop I tried a few different ways, and I got the same results, until I did: $ python -m timeit -s "s = set(range(100000, 1000100))" "next(iter(s))" 1000 loops, best of 3: 255 usec per loop ^^^^ Now something seems terribly wrong here. next(iter(s)) suddenly jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower. I realize the above has 900k keys, which is big. But 'next(iter(s))' should only touch 1, and, in fact, should always return the *first* entry. My best guess is just that the *first* entry in the internal set table is no longer close to offset 0, which means that 'next(iter(s))' has to evaluate a bunch of table slots before it finds a non-empty entry. Anyway, none of the proposals will really ever be faster than: for x in s: break It is a bit ugly of a construct, but you don't have an attribute lookup, etc. As long as you don't do: for x in s: pass Then it stays nice and fast. John =:->