# [Tutor] Please help matching elements from two lists and printing them

Jeff Shannon jeff at ccvcorp.com
Thu Dec 9 20:46:24 CET 2004

```kumar s wrote:

> On top of this this process is VERY SLOW on high end
> server too.

That's because, for each line in spot_cor, you're examining every item
in spot_int, and if there's a match, you examine every element in
spot_int again!  If I'm remembering my big-O notation correctly, that
makes this O(n*m + m) -- even if it simplifies to O(n*m) (which I
think it may), that's still quadratic time -- as the length of your
lists grows, your performance will degrade *very* rapidly.

Here's a simplified version of your code.  I've changed the for loops
so that they iterate over the lists directly, instead of iterating
over a range and then using that number to index into the list -- the
result is the same, but this way is less extra work and easier to read.

out = open('sa_int_2.txt','w')
for cor_ele in spot_cor:
for int_ele in spot_int:
cols = split(int_ele, '\t')
y = cols[0] + '\t' + cols[1]
if x == y:
for int_ele2 in spot_int:
if y in int_ele2:
out.write(int_ele2 + '\n')

Remember, every time you use 'in', it means traversing the entire
list.  Multiple levels of nesting statements that use 'in' causes a
polynomial expansion of work.

Now, here's how I'd rewrite your code.

out = open('sa_int_2.txt','w')
element_dict = {}
for line in spot_int:
cols = split(line,'\t')
key = '\t'.join(cols[:2])
element_dict[key] = line
for cor in spot_cor:
try:
value = element_dict[cor]
out.write(value + '\n')
except KeyError:
pass
out.close()

Note that I iterate through each list only once, so my timing should
be O(n + m).  First, I preprocess spot_int by putting it into a
dictionary.  I create keys for the dictionary that are identical to
what will be in spot_cor.  Then I iterate through spot_cor, and match
each element with what's in the dictionary I've made.  If I find
something that matches, I write it out; if I don't, then I simply move
on to the next item in spot_cor.

This does make several assumptions.  First, it assumes that every line
in spot_int will be unique in the first two columns -- that is, I'm
assuming that nowhere in the file would you find lines like the following:

5     0     123
5     0     456

If these cases *do* happen, then all but the last such line would be
lost, and only the last one would be left in the dictionary (and thus
found and written out).  In your version, all lines whose first two
columns match will be written out.  (This may be the source of your
"extra" results.)

If you *do* need to handle lines like this, then it wouldn't be hard
to modify my code for it.  Each value in the dictionary would become a
list, and matching lines are appended to that list.  The line
'element_dict[key] = line' becomes 'element_dict.setdefault(key,
[]).append(line)', and writing to the file becomes a for-loop instead
of a single write.

The other assumption that I make is that these files are all of
reasonable size to fit into memory.  Your code makes this same
assumption, too, of course.  This assumption should be valid for up to
a few tens of thousands (maybe even hundreds of thousands, depending
on your hardware) of items.  If you have more than this, then your
best bet would be to start using a real database.  (I've heard good
things about mysql and sqlite support in Python...)

Jeff Shannon
Technician/Programmer
Credit International

```