[Tutor] first python program to find citeulike duplicates

Thu Nov 20 14:44:04 CET 2008

Suresh Krishna wrote:
> it works (i think), but since this is my very first python program, i 
> would really appreciate feedback on how the program could be improved..

First of all, welcome, new Python programmer,

My main trouble with the program is lack of structure, some lines deal with 
individual lines of the input file, while other deal with complete 
citation-records. I think the program would improve if this is more clearly 
expressed in the code.

One of the reasons that you got the current structure may be due to the use of 
the 'for line in bibfile' loop, which gives you a new line at only one point 
in the program.
Such a for-loop is good for simple line processing (for each line do ...), but 
your program has grown beyond that imho. For this reason I'd like to propose 
to throw out the for-loop, and use a while-loop with three parts instead.

For simplicity, I left out the handling of the records (which we can discuss 
later perhaps)

inbibfile = open(InFileName,'r')

finished = False
while not finished:
     # Read & skip blank lines
     while True:
         line = inbibfile.readline()
         if len(line) == 0:
             finished = True
             break  # end of file detected
         if len(line.rstrip()) > 0:  # Line contains non-white-space
             break
         # else it was a blank line, skip, read next line

     if finished:
         break

     # line contains the first non-whitespace line of a new record
     # read the entire record until the next empty line or EOF
     current_entry = [line]
     while True:
         line = inbibfile.readline()
         if len(line) == 0:
             finished = True
             break  # Found EOF
         if len(line.rstrip()) > 0:
             current_entry.append(line)
         else: # Found an empty line, finished with current record
             break

     # finished reading (empty line or EOF encountered)
     ## Do something with current_entry

     # if not finished, do the next record

inbibfile.close()

You can go further by introducing a few functions:

inbibfile = open(InFileName,'r')

finished = False
while not finished:
     finished, line = read_blank_lines(inbibfile)
     if finished:
         break

     # line contains the first non-whitespace line of a new record
     current_entry = make_new_record(line)
     finished, current_entry = read_record_lines(current_entry, inbibfile)

     # finished reading (empty line or EOF encountered)
     ## Do something with current_entry

inbibfile.close()

and the functions then contain the details of each step.

A few comments about your current program:
> 
> 
> InFileName= "original_library.ris";

In Python, you don't end a line with a ';'.
Also, 'InFileName' is a constant, which is normally given an all-uppercase name.

> OUTDUPBIBFILE=open(OutDupFileName,'w')
> current_entry=[]
> numduplicates=0

One of the most difficult things to do in programming is achieving 
consistency. Above you have 3 global variables, and you use several different 
way of writing their names. You should try to write them the same.
The big advantage of consistent names is that just by looking at a name you 
know what kind of thing it is, which reduces the chance of making errors in 
your program.

(Don't feel bad about it, this seems a very simple and obvious problem but in 
fact it is very hard to achieve, especially for larger programs and projects 
with several people).

 >        if keyvalue not in current_keys: #is a unique entry
>             current_keys.append(keyvalue) #append current key to list of 
> keys
>             current_entry.append(line) #add the blank line to current entry
>             OUTBIBFILE.writelines(current_entry) #write out to new bib 
> file without duplicates
>             current_entry=[] #clear current entry for next one
>             current_keyval=[] #clear current key

Look at the code of each line above, then look at the comment at the same 
line. What does the comment tell you that the code didn't?

 >        continue  #dont write out successive blanks or initial blanks
 >    elif current_entry and line.isspace(): #reached a blank that demarcates 
end of current entry
 >        keyvalue=''.join(current_keyval) #generated a key based on certain 
fields
 >     elif len(line)>2: #not a blank, so more stuff in currrent entry

Now do the same here.

You see the difference?

In the first lines you repeat your code in words. In the second lines, you 
tell what you aim to achieve at that line, ie WHY do you do what you are doing 
rather than just WHAT does the line do (since that is already defined in the 
code).
In general, there is no need to comment each line. You may also assume that 
the reader knows Python (otherwise there is little value for him reading the 
code).

Try to make 'blocks' of code with one or more empty lines in between (see my 
example code), and write a comment what that block as a whole aims to do.

Hope I didn't critize you too much. For a first Python program it is quite nice.

Sincerely,
Albert