Need help with a program

Thu Jan 28 12:49:02 EST 2010

evilweasel wrote:
> I will make my question a little more clearer. I have close to 60,000
> lines of the data similar to the one I posted. There are various
> numbers next to the sequence (this is basically the number of times
> the sequence has been found in a particular sample). So, I would need
> to ignore the ones containing '0' and write all other sequences
> (excluding the number, since it is trivial) in a new text file, in the
> following format:
>
>   
>> seq59902
>>     
> TTTTTTTATAAAATATATAGT
>
>   
>> seq59903
>>     
> TTTTTTTATTTCTTGGCGTTGT
>
>   
>> seq59904
>>     
> TTTTTTTGGTTGCCCTGCGTGG
>
>   
>> seq59905
>>     
> TTTTTTTGTTTATTTTTGGG
>
> The number next to 'seq' is the line number of the sequence. When I
> run the above program, what I expect is an output file that is similar
> to the above output but with the ones containing '0' ignored. But, I
> am getting all the sequences printed in the file.
>
> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> improve in the next few months. Thanks to all those who replied. I
> really appreciate it. :)
>   
Using regexp may increase readability (if you are familiar with it). 
What about

import re

output = open("sequences1.txt", 'w')

for index, line in enumerate(open(sys.argv[1], 'r')):
    match = re.match('(?P<sequence>[GATC]+)\s+1')
    if match:
        output.write('seq%s\n%s\n' % (index, match.group('sequence')))


Jean-Michel