[Tutor] Testing if a number occurs more than once

Alfred Milgrom fredm@smartypantsco.com
Tue Dec 3 19:09:01 2002


Although I cannot answer Michael's New Scientist problem, here is some more 
on the fastest solution for finding duplicates.

The function given by Danny Yoo appears to have been the fastest test 
proposed so far:

def morethanone2(l):
     l.sort()
     for i in range(1, len(l)):
          if l[i-1] == l[i]:
              return 1
     return 0

 >>> list = range(100000)
 >>> profile.run('morethanone2(list)')
          3 function calls in 0.173 CPU seconds

However, if pure speed is the issue, the following is faster :

def morethanone3(l):
     d={}
     for x in l : d[x] = 1
     return (len(d) != len(l))

 >>> list = range(100000)
 >>> profile.run('morethanone3(list)')
          3 function calls in 0.097 CPU seconds

The result is even more impressive if one uses a shuffled list, since the 
example above using a sorted list means that sort() does next to no work:

 >>> list = range(100000)
 >>> random.shuffle(list)
 >>> profile.run('morethanone2(list)')
          3 function calls in 0.532 CPU seconds


 >>> list = range(100000)
 >>> random.shuffle(list)
 >>> profile.run('morethanone3(list)')
          3 function calls in 0.162 CPU seconds

This new function of course does not tell you where the duplication occurs. 
If this is required, you can use the following function, which is a bit 
slower but still faster than Danny's original function:

def morethanone4(l):
     d={}
     for x in l : d[x] = d.setdefault(x,0)+1
     return (len(d) != len(l))

 >>> list = range(100000)
 >>> random.shuffle(list)
 >>> profile.run('morethanone4(list)')
          3 function calls in 0.293 CPU seconds

Finally, if speed of finding what or where the duplicate is an issue, here 
is a variation on Danny's function using List Comprehension which is about 
the same speed as the original function, but potentially gives more 
information:

def morethanone5(l) :
     l.sort()
     x=[l[i] for i in range(len(l)-1) if l[i]==l[i+1]]
     return len(x)

 >>> list = range(100000)
 >>> random.shuffle(list)
 >>> profile.run('morethanone5(list)')
          3 function calls in 0.594 CPU seconds

This last function gives you access to a list with all the duplicates in it 
(if x is made a global variable).

Best regards,
Fred Milgrom