Need help with a program

Thu Jan 28 14:59:44 EST 2010

Steven Howe wrote:
> On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:
>> evilweasel wrote:
>>> I will make my question a little more clearer. I have close to 60,000
>>> lines of the data similar to the one I posted. There are various
>>> numbers next to the sequence (this is basically the number of times
>>> the sequence has been found in a particular sample). So, I would need
>>> to ignore the ones containing '0' and write all other sequences
>>> (excluding the number, since it is trivial) in a new text file, in the
>>> following format:
>>>
>>>> seq59902
>>> TTTTTTTATAAAATATATAGT
>>>
>>>> seq59903
>>> TTTTTTTATTTCTTGGCGTTGT
>>>
>>>> seq59904
>>> TTTTTTTGGTTGCCCTGCGTGG
>>>
>>>> seq59905
>>> TTTTTTTGTTTATTTTTGGG
>>>
>>> The number next to 'seq' is the line number of the sequence. When I
>>> run the above program, what I expect is an output file that is similar
>>> to the above output but with the ones containing '0' ignored. But, I
>>> am getting all the sequences printed in the file.
>>>
>>> Kindly excuse the 'newbieness' of the program. :) I am hoping to
>>> improve in the next few months. Thanks to all those who replied. I
>>> really appreciate it. :)
>> Using regexp may increase readability (if you are familiar with it). 
>> What about
>>
>> import re
>>
>> output = open("sequences1.txt", 'w')
>>
>> for index, line in enumerate(open(sys.argv[1], 'r')):
>>    match = re.match('(?P<sequence>[GATC]+)\s+1')
>>    if match:
>>        output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>>
>>
>> Jean-Michel
> 
> Finally!
> 
> After ready 8 or 9 messages about find a line ending with '1', someone 
> suggests Regex.
> It was my first thought.
> 
I'm a great fan of regexes, but I never though of using them for this
because it doesn't look like a regex type of problem to me.