[Tutor] Re: Comparing lines in two files, writing result into a third file

pan@uchicago.edu pan@uchicago.edu
Sat Apr 26 13:49:02 2003


Hi Stuart,

Scott's example is indeed excellent. So far we've got 3 different approaches:

1) Scott's dictionary
2) Danny's list comparison
3) Pan's list comprehension

I believe Scott's approach is the fastest one, but I didn't do any rate test
on it. I actually tried to see if I can modify my code to 'beat' Danny's, in
terms of speed. But I failed, miserably, hahaha ... I guess my appraoch is
the slowest one and can only be used when the data size is < 3000.

In my opinion, Scott's code can still be modified to make it even more 
efficient (or, more eloquent at least):

1) Use the built-in .setdefault() function of a dict for the folloing action:

    if d.has_key(num): d[num] += 1   # <- increment value, or
    else: d[num] = 1                 # <- create a new key

    I'll leave this for you to figure out.

2) His dictionary 'dum' looks like:

   'a':1,
   'b':1,
   'd':2,
   'f':1,
   ...

   By using "d[num] = 1" and "d[num] += 1" it saves the 'counts' as
   the dictionary values.

   It is actually better to save the 'keys' instead of 'counts'
   (d[num] = num + '\n' or + '*\n'):

   'a':'a\n',
   'b':'b\n',
   'd':'d*\n',
   'f':'f\n',
   ...

   The 'dictionary-saving' steps are exactly the same, but you don't 
   need the final checking (if d[num] > 1:) in the third part of his
   code. Instead you just go get the d.values() and that's it. This 
   would reduce the code size significantly.

The other concern is that when loading the first file:

 f1 = file('file1.txt', 'r')
 for num in f1.readlines():
    num = num.strip()       # get rid of any nasty newlines
    d[num] = 1              # and populate
 f1.close()

there's no "if d.has_key(num):" checking. That's to assume that in the
file1.txt there are no duplicate items. If there are, then Scott's code
will miss them.

Anyway enough words for now. Enjoy your py diving.

pan



> Message: 6
> Subject: RE: [Tutor] Comparing lines in two files, writing result into a
> t	hird file
> To: Scott Widney <SWidney@ci.las-vegas.nv.us>
> Cc: tutor@python.org
> From: stuart_clemons@us.ibm.com
> Date: Sat, 26 Apr 2003 10:00:30 -0400

> 
> Hi Scott:
> 
> I just wanted to say thanks again.   I was able to spend time breaking down
> the code you provided. (Start with a few lines of code snippet,  add print
> out variables, run code, see exactly what was going on, add more code
> snippet, print out variables, etc.).
> 
> Wow.  Clear, concise and dead-on !  (I'm not worthy !!!).  Extremely
> eloquent in its simplicity. This really clears up the problem I had in the
> past when I tried to read a file into a dictionary.  This structure worked
> perfectly for my immediate problem and I can see that it will work
> perfectly for variations of the this merge report that I want to provide.
> 
> This weekend I hope to look at Danny and Pan's approaches as a learning
> exercise.  Danny got me thinking about code efficiency.  I hope to look at
> some Python code I wrote about a year ago (that's remarkably still being
> used) when I last worked with Python.  I'm still a newbie, but I was a
> really a newbie then.  I know that that code could be done much more
> efficiently.
> 
> Anyway, enough rambling.  I really feel like I learned a lot just by asking
> one question.  Getting this information (and seeing some success in using
> it) has really got me psyched about Python.  Thanks again. This is a great
> forum.
> 
> - Stuart