[Python-ideas] set.add() return value
Steven D'Aprano
steve at pearwood.info
Fri Feb 13 05:54:12 CET 2009
Ralf W. Grosse-Kunstleve wrote:
>> Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
>
> This is at odds with the results of a simple experiment.
Timing experiments are tricky to get right on multi-processing machines,
which is virtually all PCs these days. For small code snippets, you are
better off using the timeit module rather than re-inventing the wheel.
That will have the very desirable outcome that others reading your code
are dealing with a known and trusted component, rather than having to
work out how you are doing your timing.
> Please try
> the attached script. On my machine I get these results:
>
> lookup repeats: 1000000
[...]
How do you determine that something fits a log N curve from just two
data points? There's an infinite number of curves that pass through two
data points.
It's true that the results you found aren't consistent with O(1), but as
I understand it, Python dicts are O(1) amortized ("on average over the
long term"). Sometimes dicts resize, which is not a constant time
operation, and sometimes the dict has to walk a short linked list, which
depends on the proportion of hashes that lead to a collisions.
But more importantly, I don't think you're necessarily measuring what
you think you're measuring. I see that you include a call to
random.randrange(N) within the timing loop. I don't think there is any
guarantee that randrange(N) will take the same amount of time for any N.
I'm not sure if that is actually the cause of your results, but it is a
potential issue. When timing, you should try to time the barest minimum
of code.
This gives a quick demonstration of constant look-up time for sets:
>>> import timeit
>>> setup = """s = set(range(%(N)d))
... found = range(%(N)d//4, %(N)d//4+10)
... missing = range(%(N)d*2, %(N)d*2+10)
... """ # assumes N is at least 14
>>>
>>> timeit.Timer('for i in found: i in s',
... setup % {'N':1000}).repeat()
[2.0811450481414795, 2.1155159473419189, 2.0662739276885986]
>>> timeit.Timer('for i in found: i in s',
... setup % {'N':10000000}).repeat()
[2.0981149673461914, 2.0697150230407715, 2.0843479633331299]
>>>
>>> timeit.Timer('for i in missing: i in s',
... setup % {'N':1000}).repeat()
[1.5208888053894043, 1.5102288722991943, 1.5023901462554932]
>>> timeit.Timer('for i in missing: i in s',
... setup % {'N':10000000}).repeat()
[1.6430721282958984, 1.6344850063323975, 1.6358041763305664]
Regards,
--
Steven
More information about the Python-ideas
mailing list