[Tutor] (no subject)

Thu Jan 7 23:47:16 CET 2010

"kumar s" <ps_python at yahoo.com> wrote

> f1 = open('fileA','r')
> f2 = open('fileB','r')
> da = f1.read().split('\n')
> dat = da[:-1]
> ba = f2.read().split('\n')
> bat = ba[:-1]

You could replace all that with

dat = open('fileA.dat').readlines()
bat = open('fileB').readlines()

> for m in dat:
>        col = m.split('\t')
>        for j in bat:

This means that for every line in dat you are reading every line in bat.
For 2 50k line files that is 50K x 50K = 2500 million reads and splits
Thats why its slow!

If you have the data sorted you can process it much more effectively
because you can keep a marker in bat to see where to start searching
Thats one option.

>                cols = j.split('\t')
>                if col[1] == cols[1]:
>                        xc = int(cols[2])
>                        yc = int(cols[3])
>                        if int(col[2]) in xrange(xc,yc):

This is generating a list then checking your cvalue against each member,
again thats slow. You could do a direct comparison with upper/lower 
boundaries:

if xc < int(col[2]) < yc:

That will be much faster.

>                                if int(col[3]) in xrange(xc,yc):

Same here.

> This code is too slow. Could you experts help me speed the script a lot 
> faster.
> In each file I have over 50K rows and the script runs very slow.

If you have the two files sorted you only need to step through the values 
in
bat until you reach invalid values, that should be faster still. You can 
never
be 100% sure with these kinds of tasks but I'd expect it to be faster to
sort both data sets and then compare than to do what you are doing.

HTH,

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/