[Tutor] Re: Comparing lines in two files, writing result into
a third [measuring an algorithm / using profile]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Wed Apr 23 18:03:01 2003
On Wed, 23 Apr 2003 pan@uchicago.edu wrote:
> IMO, Stuart's question can be solved without worrying about the sort
> order. To me the following steps are more straightforward:
>
> 1. load files into 2 lists (a,b)
> 2. mix them together into list c
> 3. build a new list d from list c. When building, check
> the count of elements in c. If count > 1, then attach *
Hi Pan,
But step three is what ends up being expensive if we're not careful. In
the code:
> f1=open('c:/py/digest/f1.txt', 'r')
> f2=open('c:/py/digest/f2.txt', 'r')
> a= [x.strip() for x in f1.readlines()] # a= ['1','3','4', '6']
> b= [x.strip() for x in f2.readlines()] # b= ['1','2','3','4','5']
> c = [((a+b).count(x)>1) and (x+'*\n') or (x+'\n') for x in a+b]
the cost of constructing 'c' can be problematic.
To count() in a list, we need to scan through the whole list.
Mathematically speaking, the cost of count()ing a list is proportional to
the length of the list.
On a list of length 'n', if we run 'count' for every element in our list,
the total cost of doing this ends up being quadratic. That is,
|-----------------| we're doing this 'nC' itself 'n' times
cost = nC + nC + .... + nC
= n^2 * C
where C is a constant of proportionality. C will stand for how expensive
doing a count() on something like [1] is. C is actually really darn
miniscule, but it's still not zero. *grin* If 'n' gets large, it'll swamp
over C.
We can test this out empirically using the 'profiling' module. First,
let's set up a small framework:
###
>>> def makeBigList(n):
... return [0] * n
...
>>> def doCounting(L): ## slightly simpler than the original
... return [L.count(x) > 1 for x in L]
...
>>> doCounting(makeBigList(10))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> import profile
>>> profile.run("doCounting(makeBigList(10))")
4 function calls in 0.020 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(doCounting)
1 0.000 0.000 0.000 0.000 <stdin>:1(makeBigList)
1 0.000 0.000 0.000 0.000 <string>:1(?)
1 0.020 0.020 0.020 0.020
profile:0(doCounting(makeBigList(10)))
0 0.000 0.000 profile:0(profiler)
###
What we plan to do is test doCounting for larger and larger lists, and see
what happens to the CPU seconds.
###
>>> test_lists = [makeBigList(n) for n in (1000, 2000, 3000)]
>>> test_lists = [makeBigList(n) for n in (1000, 2000, 3000)]
>>> for l in test_lists:
... profile.run('doCounting(l)')
...
3 function calls in 0.150 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.150 0.150 0.150 0.150 <stdin>:1(doCounting)
1 0.000 0.000 0.150 0.150 <string>:1(?)
1 0.000 0.000 0.150 0.150 profile:0(doCounting(l))
0 0.000 0.000 profile:0(profiler)
3 function calls in 0.580 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.580 0.580 0.580 0.580 <stdin>:1(doCounting)
1 0.000 0.000 0.580 0.580 <string>:1(?)
1 0.000 0.000 0.580 0.580 profile:0(doCounting(l))
0 0.000 0.000 profile:0(profiler)
3 function calls in 1.290 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.290 1.290 1.290 1.290 <stdin>:1(doCounting)
1 0.000 0.000 1.290 1.290 <string>:1(?)
1 0.000 0.000 1.290 1.290 profile:0(doCounting(l))
0 0.000 0.000 profile:0(profiler)
###
Notice that, as we make 'n' grow at a steady pace, the time it takes to
call doCounting() grows faster than linear. I ran the numbers for n going
on range(1000, 10001, 1000), and got times approximate to:
n time (sec)
-----------
1000 0.140
2000 0.570
3000 1.280
4000 2.310
5000 3.600
6000 5.180
7000 7.000
8000 9.040
9000 11.560
10000 14.260
This might not seem to be a big deal: it takes about a second to handle a
3000 line input. But if the program is meant to handle large amounts of
data, it can be worth it to learn a little more about what's expensive and
what we can do to make our programs run faster.
Anyway, I hope that clears up why we're recommending dictionaries and
merging over the direct approach.