[Fwd: Re: [Tutor] searching for data in one file from
another]
Kent Johnson
kent_johnson at skillsoft.com
Fri Nov 5 15:39:00 CET 2004
Rich,
When you read f2 with readlines, the newlines are included in the lines. So
you will never get a match with the exon from f1. Also since you are
apparently doing many tests for membership in list, a set would probably be
faster. I suggest you try something like this to create 'list':
from sets import Set
list = Set()
for line in open(exons_to_delete):
list.add(line.strip())
The rest of the program stays the same, including the test 'if exon in list'
You might want to use a different name for 'list' though.
Kent
At 09:16 AM 11/5/2004 -0500, Rich Krauter wrote:
>import sys,string
>WFILE=open(sys.argv[1], 'w')
>def
>deleteExons(fname2='Z:/datasets/altsplice1.fasta',exons_to_delete='Z:/datasets/Exonlist.txt'):
> f = open(fname2)
> f2 = open(exons_to_delete)
> list = f2.readlines()
> exon = None
> for line in f:
> if line.startswith('>'):
> exon = line[1:].split('|')[0]
> if exon in list:
> continue
> yield line
>
>
>if __name__ == '__main__':
> for line in deleteExons():
> print >> WFILE, line,
>
>exonlist is made from the last program you helped me with and consists
>of single lines of exons
>
>altsplice1.fasta is 85583 kb
>when I run the program it does not shrink the file at all, in fact
>althought the first and last 40 lines appear to be the same, the
>output file is larger than the original.
>
>It is a normal fast file:
>
>
>ENSE00001383339.1|ENSG00000187908.1|ENST00000339871.1
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 57203 to 57283|exons plus upstream and
>downstream r
>egions for exon
>ACCCAGCAAAATGGGGATCTCCACAGTCATCCTTGAAATGTGTCTTTTATGGGGACAAGTTCTATCTACAGGTATTACGT
>T
>
>
>ENSE00001387275.1|ENSG00000187908.1|ENST00000339871.1
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 72877 to 72981|exons plus upstream and
>downstream r
>egions for exon
>GAGATGGCAGGTGTCAGGGCCGAGTGGAGATCCTATACCGAGGCTCCTGGGGCACCGTGTGTGATGACAGCTGGGACACC
>AATGATGCCAACGTGGTCTGTAGGC
>
>
>ENSE00001378578.1|ENSG00000187908.1|ENST00000339871.1
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 82505 to 82835|exons plus upstream and
>downstream r
>egions for exon
>CTGAATCCAGTTTGGCCCTGAGGCTGGTGAATGGAGGTGACAGGTGTCAGGGCCGAGTGGAGGTCCTATACCGAGGCTCC
>TGGGGCACCGTGTGTGATGACAGCTGGGACACCAATGATGCCAATGTGGTCTGCAGGCAGCTGGGCTGTGGCTGGGCCAT
>GTTGGCCCCAGGAAATGCCCGGTTTGGTCAGGGCTCAGGACCCATTGTCCTGGATGACGTGCGCTGCTCAGGGAATGAGT
>CCTACTTGTGGAGCTGCCCCCACAATGGCTGGCTCTCCCATAACTGTGGCCATAGTGAAGACGCTGGTGTCATCTGCTCA
>GGTGGGCCTCC
>
>
>ENSE00001379544.1|ENSG00000187908.1|ENST00000339871.1
>assembly=NCBI34|chr=10_NT
>
>_078087|strand=forward|bases 88623 to 89087|exons plus upstream and
>downstream r
>egions for exon
>
>Any thoughts?
>
>Scott
>_______________________________________________
>Tutor maillist - Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list