[Tutor] Code optmisation

Mon Apr 7 15:40:39 CEST 2008

yogi wrote:
Some small optimizations inline, more important comments after the code.

>  #/bin/python
> import sys, os, csv, re
> x =  0                                  #Define Zero for now
> var = 1000000                           #Taking the variation 
> # This programme finds the SNPs from the range passed 
> # csv splits columns and this file is tab spaced
> fis = csv.reader(open("divs.map", "rb"), delimiter='\t', quoting=csv.QUOTE_NONE)
> for row in fis:
> # csv splits columns and this file is ","  spaced
>         gvalues = csv.reader(open("genvalues", "rb"), delimiter=',', quoting=csv.QUOTE_NONE)

If gvalues contains all integers you could say
   gvalues = map(int, gvalues)
to convert the whole list to ints at once, then you could remove the 
individual int(gvalue[x]) calls below.

>         for gvalue in gvalues:
> # To see  Columns (chr) Match 
>                 if row[0] == gvalue[0]:
> # If  Column 3  (range) is Zero  print row
>                         if int(gvalue[3]) ==  x:
>                                 a = int(gvalue[1]) - var
>                                 b = int(gvalue[2]) + var + 1
>                                 if int(a <= int(row[3]) <= b):

No need for int() here and below, the result of the comparison is a 
boolean which is what the if statement wants.

>                                         print   row
> # If  Column 3  (range) is not zero find matches and print row
>                         else:
>                                 a = int(gvalue[1]) - var
>                                 b = int(gvalue[2]) + var + 1
>                                 if int(a <= int(row[3]) <= b):
>                                         print row
>                                         c = int(gvalue[3]) - var
>                                         d = int(gvalue[4]) + var + 1
>                                         if int(c <= int(row[3]) <= d):
>                                                 print   row
> 

There will probably be other micro-optimizations you can do but first 
you need to fix your algorithm.
> -----------------------------------------------------
> 
> Question1 : Is there a better way ?

Yes. You don't say how long the above takes to run but if by 12M you 
mean 12,000,000 then I think the running time will be measured in months.

The biggest problem (other than reading genvalues from disk 12M times, 
as Alan pointed out) is that the body of the inner loop will be executed 
12M*1M times. You have to look for ways to get away from that product term.

Two possibilities:
- If the two files are sorted, you could walk through them in parallel 
finding and printing matches as you go. If the files are not sorted, you 
could sort them yourself, either reading the entire file into memory and 
sorting with Python, or using an external disk sort such as Linux 
sort(1). With this approach you would make a single pass through each file.

- Build a dict mapping gvalue[0] to a list of ranges for the gvalue. 
Precompute the ranges (int conversion and var offsets) so when you 
access the values you only have to compare, no arithmetic. Then for each 
row in fis, look up row[0] in the dict, iterate through the ranges 
associated with that value and print the matches.

How well the second approach works will depend on the number of 
different gvalue[0] values. If there are none, it won't cut down on the 
size of the inner loop at all. If there are many, it will help considerably.

> Question 2: If I have convert this code into a function.
> Should I ?

Yes. Access to local variables in a function is faster than access to 
global variables.

Kent