[Tutor] (no subject)

Fri Jan 8 14:40:07 CET 2010

On Thu, Jan 7, 2010 at 5:08 PM, kumar s <ps_python at yahoo.com> wrote:

> I want to take coordiates x and y from each row in file a, and check if they are in range of zx and zy. If they are in range then I want to be able to write both matched rows in a tab delim single row.
>
>
> my code:
>
> f1 = open('fileA','r')
> f2 = open('fileB','r')
> da = f1.read().split('\n')
> dat = da[:-1]
> ba = f2.read().split('\n')
> bat = ba[:-1]
>
>
> for m in dat:
>        col = m.split('\t')
>        for j in bat:
>                cols = j.split('\t')
>                if col[1] == cols[1]:
>                        xc = int(cols[2])
>                        yc = int(cols[3])
>                        if int(col[2]) in xrange(xc,yc):
>                                if int(col[3]) in xrange(xc,yc):
>                                        print m+'\t'+j
>
> output:
> a       4       40811596        40811620    z1 4 +  40810323     40812000
>
>
>
> This code is too slow. Could you experts help me speed the script a lot faster.
> In each file I have over 50K rows and the script runs very slow.

As others have pointed out you are doing way too much work in your
inner loob. You should at least preprocess bat so you aren't doing the
split and conversion on each line each time through the loop.

But the bigger problem is the nested loops themselves, the inner loop
will run 2,500,000,000 times which is likely to take a while.

To fix this you need to find a faster way to search bat. Interval
trees are one way:
http://en.wikipedia.org/wiki/Interval_tree
http://hackmap.blogspot.com/2008/11/python-interval-tree.html

Kent