[Python-ideas] set.add() return value

Steven D'Aprano steve at pearwood.info
Fri Feb 13 05:54:12 CET 2009

Ralf W. Grosse-Kunstleve wrote:

>> Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
> This is at odds with the results of a simple experiment.

Timing experiments are tricky to get right on multi-processing machines, 
which is virtually all PCs these days. For small code snippets, you are 
better off using the timeit module rather than re-inventing the wheel. 
That will have the very desirable outcome that others reading your code 
are dealing with a known and trusted component, rather than having to 
work out how you are doing your timing.

 > Please try
> the attached script. On my machine I get these results:
> lookup repeats: 1000000

How do you determine that something fits a log N curve from just two 
data points? There's an infinite number of curves that pass through two 
data points.

It's true that the results you found aren't consistent with O(1), but as 
I understand it, Python dicts are O(1) amortized ("on average over the 
long term"). Sometimes dicts resize, which is not a constant time 
operation, and sometimes the dict has to walk a short linked list, which 
depends on the proportion of hashes that lead to a collisions.

But more importantly, I don't think you're necessarily measuring what 
you think you're measuring. I see that you include a call to 
random.randrange(N) within the timing loop. I don't think there is any 
guarantee that randrange(N) will take the same amount of time for any N. 
I'm not sure if that is actually the cause of your results, but it is a 
potential issue. When timing, you should try to time the barest minimum 
of code.

This gives a quick demonstration of constant look-up time for sets:

 >>> import timeit
 >>> setup = """s = set(range(%(N)d))
... found = range(%(N)d//4, %(N)d//4+10)
... missing = range(%(N)d*2, %(N)d*2+10)
... """  # assumes N is at least 14
 >>> timeit.Timer('for i in found: i in s',
... setup % {'N':1000}).repeat()
[2.0811450481414795, 2.1155159473419189, 2.0662739276885986]
 >>> timeit.Timer('for i in found: i in s',
... setup % {'N':10000000}).repeat()
[2.0981149673461914, 2.0697150230407715, 2.0843479633331299]
 >>> timeit.Timer('for i in missing: i in s',
... setup % {'N':1000}).repeat()
[1.5208888053894043, 1.5102288722991943, 1.5023901462554932]
 >>> timeit.Timer('for i in missing: i in s',
... setup % {'N':10000000}).repeat()
[1.6430721282958984, 1.6344850063323975, 1.6358041763305664]



More information about the Python-ideas mailing list