[Tutor] Testing if a number occurs more than once [profiling a function using profile.run()]

Mon Dec 2 17:10:03 2002

On Mon, 2 Dec 2002, Yann Le Du wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, 2 Dec 2002, Michael Williams wrote:
>
> > as not only convoluted, but slow (this is the dominant loop of the
> > program). There must be a better way! Any suggestions?
>
> try :
>
> def morethanone(l):
>         for n in l:
>                 if l.count(n)>1:
>                         return 1
>         return 0

This is an expensive way to do the uniqueness check, since l.count()
itself will be doing a loop to run through every element in l.  There are
really two nested loops that Python is running through.

If we use morethanone() on long lists, in a worst-case scenario, the time
it takes to check for uniqueness will be proportional to the number of
elements, squared.  It's squared because we can roughly estimate the cost
of l.count()ing to be len(l), and we're doing the l.count()ing len(l)
times, so the total cost is roughly len(l)**2.

We can test this performance empirically by using the nice 'profile'
module:

    http://www.python.org/doc/lib/profile-instant.html

For example:

###
>>> import profile
>>> def morethanone(l):
...     for n in l:
...         if l.count(n) > 1:
...             return 1
...     return 0
...
>>> l1, l2, l3 = range(1000), range(10000), range(100000)
>>> profile.run('morethanone(l1)')
         3 function calls in 0.180 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.140    0.140    0.140    0.140 <stdin>:1(morethanone)
        1    0.000    0.000    0.140    0.140 <string>:1(?)
        1    0.040    0.040    0.180    0.180 profile:0(morethanone(l1))
        0    0.000             0.000          profile:0(profiler)
>>> profile.run('morethanone(l2)')
         3 function calls in 15.630 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   15.630   15.630   15.630   15.630 <stdin>:1(morethanone)
        1    0.000    0.000   15.630   15.630 <string>:1(?)
        1    0.000    0.000   15.630   15.630 profile:0(morethanone(l2))
        0    0.000             0.000          profile:0(profiler)

>>> profile.run('morethanone(l3)')
# ... time passes
###

Actually, 'morethanone(l3)' was taking way way too long, so I cancelled
it.  But already we can see the difference between the first two calls of
morethanone(l1) and morethanone(l2) --- the second call takes about 100
times as long to finish as the first, and we'd expect morethanone(l3) to
take about 0.180 seconds * (100)*2 = 1800 seconds = 30 minutes.  Ouch.

A better approach could be to first sort() the list of numbers, and run
through adjacent pairs of elements.  The duplicates, if they exist, will
clump up together during the sort, and will be easy to pick up when we
compare adjacent elements.

###
>>> def morethanone2(l):
...     l.sort()
...     for i in range(1, len(l)):
...          if l[i-1] == len[i]:
...              return 1
...     return 0
...
>>> def morethanone2(l):
...     l.sort()
...     for i in range(1, len(l)):
...          if l[i-1] == l[i]:
...              return 1
...     return 0
...
>>> profile.run('morethanone2(l1)')
         3 function calls in 0.000 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <stdin>:1(morethanone2)
        1    0.000    0.000    0.000    0.000 <string>:1(?)
        1    0.000    0.000    0.000    0.000 profile:0(morethanone2(l1))
        0    0.000             0.000          profile:0(profiler)

>>> profile.run('morethanone2(l2)')
         3 function calls in 0.010 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010    0.010    0.010 <stdin>:1(morethanone2)
        1    0.000    0.000    0.010    0.010 <string>:1(?)
        1    0.000    0.000    0.010    0.010 profile:0(morethanone2(l2))
        0    0.000             0.000          profile:0(profiler)

>>> profile.run('morethanone2(l3)')
         3 function calls in 0.160 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.160    0.160    0.160    0.160 <stdin>:1(morethanone2)
        1    0.000    0.000    0.160    0.160 <string>:1(?)
        1    0.000    0.000    0.160    0.160 profile:0(morethanone2(l3))
        0    0.000             0.000          profile:0(profiler)
###

Here, we can see that the time taken as we bump up the size of the input
grows a little faster than linear:

    size of input      time it takes
    ------------       -------------
       1000                0.0
       10000               0.010
       100000              0.160

which makes sense: sort()ing isn't too expensive, and we just need to do a
single scan across the sorted list to find duplicates.

Does anyone want to check how Michael's original morethanone() function
performs?

Good luck to you!