[Tutor] List processing

Thu Jun 2 00:33:47 CEST 2005

On 1 Jun 2005 cgw501 at york.ac.uk wrote:

> eYAL001C1	Spar	81	3419	4518	4519	2	1	
> 
> So basically its a table, separated with tabs. What I need to do is make
> a new file where all the entries in the table are those where the values
> in columns 1 and 5 were present as a pair more than once in the original
> file.

This is half-baked, but I toss it out in case anyone can build on it.

Create a dictionary, keyed on column 1.  Read a line and split it into 
the columns.  For each line, create a dictionary entry that is a 
dictionary keyed by column 5, whose entry is a list of lists, the inner 
list of which contains columns 2, 3, 4 and 6.  When a dupe is found, add 
an additional inner list.

So, upon processing this line, you have a dictionary D:

{'eYAL001C1': {'4518': [['Spar', '3419', '4519', '2', '1']]}}

As you process each new line, one of three things is true:

 1) Col 1 is used as a key, but col5 is not used as an inner key;
 2) Col 1 is used as a key, and col5 is used as an inner key
 3) Col 1 is not used as a key

So, for each new line:

 if col1 in d.keys():
    if col5 in d[col1].keys()
      d[col1][col5].append([col2, col3, col4, col6])
    else
      d[col1][col5] = [[col2, col3, col4, col6]]
 else:
  d[col1]={col5:[[col2, col3, col4, col6]

The end result is that you'll have all your data from the file in the form 
of a dictionary indexed by column 1.  Each entry in the top-level 
dictionary is a second-level dictionary indexed by column 2.  Each entry 
in that second-level dictionary is a list of lists, and each list in that 
list of lists is columns 2, 3, 4 and 6.

if the list of lists has a length of 1, then the col1/col5 combo only 
appears once in the input file.  But if it has a length > 1, it occurred 
more than once, and satisfies you condition of "columns 1 and 5 were 
present as a pair more than once"

So to get at these:

 for key1 in d:
   for key2 in d[key1]:
    if len(d[key1][key2]) > 1:
      for l in d[key1][key2]:
        print key1, l[0], l[1], l[2], key2, l[3]

I haven't tested this approach (or syntax) but I think the approach is 
basically sound.