[Tutor] Re: Comparing lines in two files, writing result into a third [measuring an algorithm / using profile]

Wed Apr 23 18:03:01 2003

On Wed, 23 Apr 2003 pan@uchicago.edu wrote:

> IMO, Stuart's question can be solved without worrying about the sort
> order. To me the following steps are more straightforward:
>
> 1. load files into 2 lists (a,b)
> 2. mix them together into list c
> 3. build a new list d from list c. When building, check
>    the count of elements in c. If count > 1, then attach *

Hi Pan,

But step three is what ends up being expensive if we're not careful.  In
the code:

> f1=open('c:/py/digest/f1.txt', 'r')
> f2=open('c:/py/digest/f2.txt', 'r')
> a= [x.strip() for x in f1.readlines()]  # a= ['1','3','4', '6']
> b= [x.strip() for x in f2.readlines()]  # b= ['1','2','3','4','5']
> c = [((a+b).count(x)>1) and (x+'*\n') or (x+'\n') for x in a+b]

the cost of constructing 'c' can be problematic.

To count() in a list, we need to scan through the whole list.
Mathematically speaking, the cost of count()ing a list is proportional to
the length of the list.

On a list of length 'n', if we run 'count' for every element in our list,
the total cost of doing this ends up being quadratic.  That is,

       |-----------------|  we're doing this 'nC' itself 'n' times

cost = nC + nC + .... + nC

     = n^2 * C

where C is a constant of proportionality.  C will stand for how expensive
doing a count() on something like [1] is.  C is actually really darn
miniscule, but it's still not zero.  *grin* If 'n' gets large, it'll swamp
over C.

We can test this out empirically using the 'profiling' module.  First,
let's set up a small framework:

###
>>> def makeBigList(n):
...     return [0] * n
...
>>> def doCounting(L):          ## slightly simpler than the original
...     return [L.count(x) > 1 for x in L]
...
>>> doCounting(makeBigList(10))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> import profile
>>> profile.run("doCounting(makeBigList(10))")
         4 function calls in 0.020 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <stdin>:1(doCounting)
        1    0.000    0.000    0.000    0.000 <stdin>:1(makeBigList)
        1    0.000    0.000    0.000    0.000 <string>:1(?)
        1    0.020    0.020    0.020    0.020
profile:0(doCounting(makeBigList(10)))
        0    0.000             0.000          profile:0(profiler)
###

What we plan to do is test doCounting for larger and larger lists, and see
what happens to the CPU seconds.

###
>>> test_lists = [makeBigList(n) for n in (1000, 2000, 3000)]
>>> test_lists = [makeBigList(n) for n in (1000, 2000, 3000)]
>>> for l in test_lists:
...     profile.run('doCounting(l)')
...
         3 function calls in 0.150 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.150    0.150    0.150    0.150 <stdin>:1(doCounting)
        1    0.000    0.000    0.150    0.150 <string>:1(?)
        1    0.000    0.000    0.150    0.150 profile:0(doCounting(l))
        0    0.000             0.000          profile:0(profiler)

         3 function calls in 0.580 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.580    0.580    0.580    0.580 <stdin>:1(doCounting)
        1    0.000    0.000    0.580    0.580 <string>:1(?)
        1    0.000    0.000    0.580    0.580 profile:0(doCounting(l))
        0    0.000             0.000          profile:0(profiler)

         3 function calls in 1.290 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.290    1.290    1.290    1.290 <stdin>:1(doCounting)
        1    0.000    0.000    1.290    1.290 <string>:1(?)
        1    0.000    0.000    1.290    1.290 profile:0(doCounting(l))
        0    0.000             0.000          profile:0(profiler)
###

Notice that, as we make 'n' grow at a steady pace, the time it takes to
call doCounting() grows faster than linear.  I ran the numbers for n going
on range(1000, 10001, 1000), and got times approximate to:

n      time (sec)
-----------
1000   0.140
2000   0.570
3000   1.280
4000   2.310
5000   3.600
6000   5.180
7000   7.000
8000   9.040
9000   11.560
10000  14.260

This might not seem to be a big deal: it takes about a second to handle a
3000 line input.  But if the program is meant to handle large amounts of
data, it can be worth it to learn a little more about what's expensive and
what we can do to make our programs run faster.

Anyway, I hope that clears up why we're recommending dictionaries and
merging over the direct approach.