[Tutor] List processing

Thu Jun 2 04:47:22 CEST 2005

cgw501 at york.ac.uk wrote:
> Hi,
> 
> I have a load of files I need to process. Each line of a file looks 
> something like this:
> 
> eYAL001C1	Spar	81	3419	4518	4519	2	1	
> 
> So basically its a table, separated with tabs. What I need to do is make a 
> new file where all the entries in the table are those where the values in 
> columns 1 and 5 were present as a pair more than once in the original file.
> 
> I really have very little idea how to achiev this. So far I read in the 
> file to a list , where each item in the list is a list of the entries on a 
> line.

I would do this with two passes over the data. The first pass would accumulate lines and count pairs 
of (col1, col5); the second pass would output the lines whose count is > 1. Something like this 
(untested):

lines = []
counts = {}

# Build a list of split lines and count the (col1, col5) pairs
for line in open('input.txt'):
   line = line.split()  # break line on tabs
   key = (line[1], line[5])  # or (line[0], line[4]) depending on what you mean by col 1
   counts[key] = counts.get(key, 0) + 1  # count the key pair
   lines.append(line)

# Output the lines whose pairs appear more than once
f = open('output.txt', 'w')
for line in lines:
   if counts[(line[1], line[5])] > 1:
     f.write('\t'.join(line))
     f.write('\n')
f.close()

Kent