# [Tutor] Code optmisation

Kent Johnson kent37 at tds.net
Mon Apr 7 15:40:39 CEST 2008

yogi wrote:
Some small optimizations inline, more important comments after the code.

>  #/bin/python
> import sys, os, csv, re
> x =  0                                  #Define Zero for now
> var = 1000000                           #Taking the variation
> # This programme finds the SNPs from the range passed
> # csv splits columns and this file is tab spaced
> fis = csv.reader(open("divs.map", "rb"), delimiter='\t', quoting=csv.QUOTE_NONE)
> for row in fis:
> # csv splits columns and this file is ","  spaced
>         gvalues = csv.reader(open("genvalues", "rb"), delimiter=',', quoting=csv.QUOTE_NONE)

If gvalues contains all integers you could say
gvalues = map(int, gvalues)
to convert the whole list to ints at once, then you could remove the
individual int(gvalue[x]) calls below.

>         for gvalue in gvalues:
> # To see  Columns (chr) Match
>                 if row[0] == gvalue[0]:
> # If  Column 3  (range) is Zero  print row
>                         if int(gvalue[3]) ==  x:
>                                 a = int(gvalue[1]) - var
>                                 b = int(gvalue[2]) + var + 1
>                                 if int(a <= int(row[3]) <= b):

No need for int() here and below, the result of the comparison is a
boolean which is what the if statement wants.

>                                         print   row
> # If  Column 3  (range) is not zero find matches and print row
>                         else:
>                                 a = int(gvalue[1]) - var
>                                 b = int(gvalue[2]) + var + 1
>                                 if int(a <= int(row[3]) <= b):
>                                         print row
>                                         c = int(gvalue[3]) - var
>                                         d = int(gvalue[4]) + var + 1
>                                         if int(c <= int(row[3]) <= d):
>                                                 print   row
>

There will probably be other micro-optimizations you can do but first
you need to fix your algorithm.
> -----------------------------------------------------
>
> Question1 : Is there a better way ?

Yes. You don't say how long the above takes to run but if by 12M you
mean 12,000,000 then I think the running time will be measured in months.

The biggest problem (other than reading genvalues from disk 12M times,
as Alan pointed out) is that the body of the inner loop will be executed
12M*1M times. You have to look for ways to get away from that product term.

Two possibilities:
- If the two files are sorted, you could walk through them in parallel
finding and printing matches as you go. If the files are not sorted, you
could sort them yourself, either reading the entire file into memory and
sorting with Python, or using an external disk sort such as Linux
sort(1). With this approach you would make a single pass through each file.

- Build a dict mapping gvalue[0] to a list of ranges for the gvalue.
Precompute the ranges (int conversion and var offsets) so when you
access the values you only have to compare, no arithmetic. Then for each
row in fis, look up row[0] in the dict, iterate through the ranges
associated with that value and print the matches.

How well the second approach works will depend on the number of
different gvalue[0] values. If there are none, it won't cut down on the
size of the inner loop at all. If there are many, it will help considerably.

> Question 2: If I have convert this code into a function.
> Should I ?