Ralf W. Grosse-Kunstleve wrote:
Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
This is at odds with the results of a simple experiment.
Timing experiments are tricky to get right on multi-processing machines, which is virtually all PCs these days. For small code snippets, you are better off using the timeit module rather than re-inventing the wheel. That will have the very desirable outcome that others reading your code are dealing with a known and trusted component, rather than having to work out how you are doing your timing.
Please try the attached script. On my machine I get these results:
lookup repeats: 1000000 [...]
How do you determine that something fits a log N curve from just two data points? There's an infinite number of curves that pass through two data points. It's true that the results you found aren't consistent with O(1), but as I understand it, Python dicts are O(1) amortized ("on average over the long term"). Sometimes dicts resize, which is not a constant time operation, and sometimes the dict has to walk a short linked list, which depends on the proportion of hashes that lead to a collisions. But more importantly, I don't think you're necessarily measuring what you think you're measuring. I see that you include a call to random.randrange(N) within the timing loop. I don't think there is any guarantee that randrange(N) will take the same amount of time for any N. I'm not sure if that is actually the cause of your results, but it is a potential issue. When timing, you should try to time the barest minimum of code. This gives a quick demonstration of constant look-up time for sets:
import timeit setup = """s = set(range(%(N)d)) ... found = range(%(N)d//4, %(N)d//4+10) ... missing = range(%(N)d*2, %(N)d*2+10) ... """ # assumes N is at least 14
timeit.Timer('for i in found: i in s', ... setup % {'N':1000}).repeat() [2.0811450481414795, 2.1155159473419189, 2.0662739276885986] timeit.Timer('for i in found: i in s', ... setup % {'N':10000000}).repeat() [2.0981149673461914, 2.0697150230407715, 2.0843479633331299]
timeit.Timer('for i in missing: i in s', ... setup % {'N':1000}).repeat() [1.5208888053894043, 1.5102288722991943, 1.5023901462554932] timeit.Timer('for i in missing: i in s', ... setup % {'N':10000000}).repeat() [1.6430721282958984, 1.6344850063323975, 1.6358041763305664]
Regards, -- Steven