[Tutor] Please help matching elements from two lists and printing them

Jeff Shannon jeff at ccvcorp.com
Thu Dec 9 20:46:24 CET 2004


kumar s wrote:

> On top of this this process is VERY SLOW on high end
> server too. 

That's because, for each line in spot_cor, you're examining every item 
in spot_int, and if there's a match, you examine every element in 
spot_int again!  If I'm remembering my big-O notation correctly, that 
makes this O(n*m + m) -- even if it simplifies to O(n*m) (which I 
think it may), that's still quadratic time -- as the length of your 
lists grows, your performance will degrade *very* rapidly.

Here's a simplified version of your code.  I've changed the for loops 
so that they iterate over the lists directly, instead of iterating 
over a range and then using that number to index into the list -- the 
result is the same, but this way is less extra work and easier to read.

     out = open('sa_int_2.txt','w')
     for cor_ele in spot_cor:
         for int_ele in spot_int:
             cols = split(int_ele, '\t')
             y = cols[0] + '\t' + cols[1]
             if x == y:
                 for int_ele2 in spot_int:
                     if y in int_ele2:
                         out.write(int_ele2 + '\n')


Remember, every time you use 'in', it means traversing the entire 
list.  Multiple levels of nesting statements that use 'in' causes a 
polynomial expansion of work.

Now, here's how I'd rewrite your code.

     out = open('sa_int_2.txt','w')
     element_dict = {}
     for line in spot_int:
         cols = split(line,'\t')
         key = '\t'.join(cols[:2])
         element_dict[key] = line
     for cor in spot_cor:
         try:
             value = element_dict[cor]
             out.write(value + '\n')
         except KeyError:
             pass
     out.close()

Note that I iterate through each list only once, so my timing should 
be O(n + m).  First, I preprocess spot_int by putting it into a 
dictionary.  I create keys for the dictionary that are identical to 
what will be in spot_cor.  Then I iterate through spot_cor, and match 
each element with what's in the dictionary I've made.  If I find 
something that matches, I write it out; if I don't, then I simply move 
on to the next item in spot_cor.

This does make several assumptions.  First, it assumes that every line 
in spot_int will be unique in the first two columns -- that is, I'm 
assuming that nowhere in the file would you find lines like the following:

      5     0     123
      5     0     456

If these cases *do* happen, then all but the last such line would be 
lost, and only the last one would be left in the dictionary (and thus 
found and written out).  In your version, all lines whose first two 
columns match will be written out.  (This may be the source of your 
"extra" results.)

If you *do* need to handle lines like this, then it wouldn't be hard 
to modify my code for it.  Each value in the dictionary would become a 
list, and matching lines are appended to that list.  The line 
'element_dict[key] = line' becomes 'element_dict.setdefault(key, 
[]).append(line)', and writing to the file becomes a for-loop instead 
of a single write.

The other assumption that I make is that these files are all of 
reasonable size to fit into memory.  Your code makes this same 
assumption, too, of course.  This assumption should be valid for up to 
a few tens of thousands (maybe even hundreds of thousands, depending 
on your hardware) of items.  If you have more than this, then your 
best bet would be to start using a real database.  (I've heard good 
things about mysql and sqlite support in Python...)

Jeff Shannon
Technician/Programmer
Credit International





More information about the Tutor mailing list