Re: [Python-Dev] Retrieve an arbitrary element from a setwithoutremoving it

6 Nov 2009

      geremy condra wrote:
...
On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky
 wrote:
...
On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser  wrote:
...
.. and "x = iter(s).next()" raises a StopIteration
exception.
And that's why the documented recipe should probably recommend
next(iter(s), default) instead.  Especially because iter(s).next() is
not even valid code in 3.0.
This seems reasonably legible to you? Strikes me as coding by
incantation. Also, while I've heard people say that the naive
approach is slower, I'm not getting that result. Here's my test:
...
...
...
smrt = timeit.Timer("next(iter(s))", "s=set(range(100))")
smrt.repeat(10)
[1.2845709323883057, 0.60247397422790527, 0.59621405601501465,
0.59133195877075195, 0.58387589454650879, 0.56839084625244141,
0.56839680671691895, 0.56877803802490234, 0.56905913352966309,
0.56846404075622559]
...
...
...
naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))")
naive.repeat(10)
[0.93139314651489258, 0.53566789627075195, 0.53674602508544922,
0.53608798980712891, 0.53634309768676758, 0.53557991981506348,
0.53578495979309082, 0.53666114807128906, 0.53576493263244629,
0.53491711616516113]
Perhaps my test is flawed in some way?
Geremy Condra
Well, it also uses a fairly trivial set. 'set(range(100))' is going to
generally have 0 collisions and everything will hash to a unique bucket.
 As such, pop ing an item out of the hash is a single "val = table[int &
mask]; table[int & mask] = _dummy", and then looking it up again
requires 2 table lookups (one finds the _dummy, the next finds that the
chain is broken and can rewrite the _dummy.)

However, if a set is more full, or has more collisions, or ... then
pop() and add() become relatively more expensive.

Surprising to me, is that "next(iter(s))" was actually slower than
.pop() + .add() for 100 node set in my testing:

$ alias TIMEIT="python -m timeit -s 's = set(range(100)'"
$ TIMEIT "x = next(iter(s))"
1000000 loops, best of 3: 0.263 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
1000000 loops, best of 3: 0.217 usec per loop

though both are much slower than the fastest we've found:
$ TIMEIT "for x in s: break"
10000000 loops, best of 3: 0.0943 usec per loop

now, what about a set with *lots* of collisions. Create 100 keys that
all hash to the same bucket:
aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in
range(100)])'"
$ TIMEIT "x = next(iter(s))"
1000000 loops, best of 3: 0.257 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
1000000 loops, best of 3: 0.218 usec per loop

I tried a few different ways, and I got the same results, until I did:

$ python -m timeit -s "s = set(range(100000, 1000100))" "next(iter(s))"
1000 loops, best of 3: 255 usec per loop

^^^^ Now something seems terribly wrong here. next(iter(s)) suddenly
jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower.

I realize the above has 900k keys, which is big. But 'next(iter(s))'
should only touch 1, and, in fact, should always return the *first*
entry. My best guess is just that the *first* entry in the internal set
table is no longer close to offset 0, which means that 'next(iter(s))'
has to evaluate a bunch of table slots before it finds a non-empty entry.

Anyway, none of the proposals will really ever be faster than:
  for x in s: break

It is a bit ugly of a construct, but you don't have an attribute lookup,
etc. As long as you don't do:
  for x in s: pass

Then it stays nice and fast.

John
=:->

Re: [Python-Dev] Retrieve an arbitrary element from a setwithoutremoving it

John Arbash Meinel