how fast is Python code - another detail

Gregor Lingl glingl at
Thu Mar 4 19:43:48 CET 2004

"Quality of design" may - for instance - consist
in deleting 7 characters:

Today I made - by accident - an interesting
observation. The following is a snippet from
a program which computes anagrams from a wordlist:

def find_anagrams(wordlist):
     d = {}
     for word in wordlist:
	sig = anagram_signature(word)
	if sig not in d:
             d[sig] = []     # d.setdefault intentionally not used
     return d

On my machine it computes d
for a wordlist of 10000 items in less than 0.13s
and for a wordlist of 20000 items in less than 0.25s

Today one of my students inadvertedly wrote in the fifth line

         if sig not in d.keys():

and now timing of find_anagrams had the following results:

for a wordlist of 10000 items approx. 9.33s, i.e a factor
of approx 65 slower

and for a wordlist of 20000 items approx. 60s, i.e. a factor
of approx 250 slower. (!!!) (Imagine that the full wordlist
in this example has 45000 entires, so an additional factor of
16 applies ...)

I think this is a very astonishing and relevant difference!
I think I understand why it is the case - the timing factor
is proportional to the square of len(wordlist), because
in every loop a list d.keys() has to be computed and now
this has to be used for searching for the key sig,
while for sig in d uses some built-in generator-property
of dictionaries ...  (?)

*on_the_other_ hand* I thought, that those two idioms
are semantically identical. So my question: why does the
Python interpreter not recognize this identity and internally
replace the latter by the first in order to get *much*
better performance?

- would this lead to wrong results in some cases?
- or would it be too difficult to implement?
- or is this considered a problem of minor importance?
- or was it only forgotten to do?
- or what else?


P.S.: Or was the key in dict idiom introduced exactly
inorder to solve this problem?

More information about the Python-list mailing list