[Python-Dev] Retrieve an arbitrary element from a setwithoutremoving it

John Arbash Meinel john.arbash.meinel at gmail.com
Fri Nov 6 05:21:05 CET 2009


geremy condra wrote:
> On Thu, Nov 5, 2009 at 4:09 PM, Alexander Belopolsky
> <alexander.belopolsky at gmail.com> wrote:
>> On Thu, Nov 5, 2009 at 3:43 PM, Chris Bergstresser <chris at subtlety.com> wrote:
>>> .. and "x = iter(s).next()" raises a StopIteration
>>> exception.
>> And that's why the documented recipe should probably recommend
>> next(iter(s), default) instead.  Especially because iter(s).next() is
>> not even valid code in 3.0.
> 
> This seems reasonably legible to you? Strikes me as coding by
> incantation. Also, while I've heard people say that the naive
> approach is slower, I'm not getting that result. Here's my test:
> 
> 
>>>> smrt = timeit.Timer("next(iter(s))", "s=set(range(100))")
>>>> smrt.repeat(10)
> [1.2845709323883057, 0.60247397422790527, 0.59621405601501465,
> 0.59133195877075195, 0.58387589454650879, 0.56839084625244141,
> 0.56839680671691895, 0.56877803802490234, 0.56905913352966309,
> 0.56846404075622559]
> 
>>>> naive = timeit.Timer("x=s.pop();s.add(x)", "s=set(range(100))")
>>>> naive.repeat(10)
> [0.93139314651489258, 0.53566789627075195, 0.53674602508544922,
> 0.53608798980712891, 0.53634309768676758, 0.53557991981506348,
> 0.53578495979309082, 0.53666114807128906, 0.53576493263244629,
> 0.53491711616516113]
> 
> 
> Perhaps my test is flawed in some way?
> 
> Geremy Condra

Well, it also uses a fairly trivial set. 'set(range(100))' is going to
generally have 0 collisions and everything will hash to a unique bucket.
 As such, pop ing an item out of the hash is a single "val = table[int &
mask]; table[int & mask] = _dummy", and then looking it up again
requires 2 table lookups (one finds the _dummy, the next finds that the
chain is broken and can rewrite the _dummy.)

However, if a set is more full, or has more collisions, or ... then
pop() and add() become relatively more expensive.

Surprising to me, is that "next(iter(s))" was actually slower than
.pop() + .add() for 100 node set in my testing:

$ alias TIMEIT="python -m timeit -s 's = set(range(100)'"
$ TIMEIT "x = next(iter(s))"
1000000 loops, best of 3: 0.263 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
1000000 loops, best of 3: 0.217 usec per loop

though both are much slower than the fastest we've found:
$ TIMEIT "for x in s: break"
10000000 loops, best of 3: 0.0943 usec per loop


now, what about a set with *lots* of collisions. Create 100 keys that
all hash to the same bucket:
aliase TIMEIT="python -m timeit -s 's = set([x*1024*1024 for x in
range(100)])'"
$ TIMEIT "x = next(iter(s))"
1000000 loops, best of 3: 0.257 usec per loop

$ TIMEIT "x = s.pop(); s.add(x)"
1000000 loops, best of 3: 0.218 usec per loop

I tried a few different ways, and I got the same results, until I did:

$ python -m timeit -s "s = set(range(100000, 1000100))" "next(iter(s))"
1000 loops, best of 3: 255 usec per loop

^^^^ Now something seems terribly wrong here. next(iter(s)) suddenly
jumps up from being < 0.3 us, to being more than 200us. Or ~1000x slower.

I realize the above has 900k keys, which is big. But 'next(iter(s))'
should only touch 1, and, in fact, should always return the *first*
entry. My best guess is just that the *first* entry in the internal set
table is no longer close to offset 0, which means that 'next(iter(s))'
has to evaluate a bunch of table slots before it finds a non-empty entry.


Anyway, none of the proposals will really ever be faster than:
  for x in s: break

It is a bit ugly of a construct, but you don't have an attribute lookup,
etc. As long as you don't do:
  for x in s: pass

Then it stays nice and fast.

John
=:->


More information about the Python-Dev mailing list