[Python-ideas] set.add() return value
Ralf W. Grosse-Kunstleve
rwgk at yahoo.com
Fri Feb 13 04:36:35 CET 2009
> > Sorry, I was thinking the .add() is happening inside a loop with N iterations,
> > with N also being the eventual size of the set. Is O(N log N) correct then?
> >
> > http://en.wikipedia.org/wiki/Big_O_notation says:
> >
> > O(log N) example: finding an item in a sorted array with a binary search.
> >
> > Isn't that what set is doing?
>
> Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
This is at odds with the results of a simple experiment. Please try
the attached script. On my machine I get these results:
lookup repeats: 1000000
timing set lookup
N: 10000
overhead: 1.16
lookup contained: 0.07
lookup missing: 0.12
timing set lookup
N: 10000000
overhead: 1.13
lookup contained: 0.42
lookup missing: 0.34
timing dict lookup
N: 10000
overhead: 1.16
lookup contained: 0.06
lookup missing: 0.11
timing dict lookup
N: 10000000
overhead: 1.12
lookup contained: 0.44
lookup missing: 0.44
N is the size of a set or dict. The number of lookups of random elements
is the same in both cases. That's why the "overhead" is constant.
Of course, the memory cache does have an influence here, but
your claim is also at odds with anything I could find in the
literature. A dictionary or set has to do some kind of search to
find a key or element. Dictionary and set lookups have O(log N)
runtime complexity, where N is the size of the dict or set at
the time of the lookup.
To come back to my original suggestion, Raymond's reply...
> * because of caching the second lookup is very cheap.
makes me think that this...
> if (1 not in s):
> s.add(1)
> do_something_else()
is basically as fast as I was hoping to get from:
> if (s.add(1)):
> do_something_else()
So I guess that kills my runtime argument.
> * the bdfl frowns on mutating methods returning anything at all
Shock! What about dict.setdefault()?
Honestly, I found it very odd from the start that set.add() doesn't
tell me what it did.
import random
import time
import os
import sys
repeats = 1000000
print "lookup repeats:", repeats
for now_with_dict in [False, True]:
random.seed(0)
for N in [10000, 10000000]:
if (not now_with_dict):
print "timing set lookup"
s = set(xrange(N))
else:
print "timing dict lookup"
s = {}
for i in xrange(N):
s[i] = 0
print " N:", len(s)
t0 = os.times()[0]
for i in xrange(repeats):
j = random.randrange(N)
assert j < N
oh = os.times()[0]-t0
print " overhead: %.2f" % oh
t0 = os.times()[0]
for i in xrange(repeats):
j = random.randrange(N)
assert j in s
print " lookup contained: %.2f" % (os.times()[0]-t0-oh)
t0 = os.times()[0]
for i in xrange(repeats):
j = N + random.randrange(N)
assert j not in s
print " lookup missing: %.2f" % (os.times()[0]-t0-oh)
More information about the Python-ideas
mailing list