[Tutor] Simple string processing problem

cgw501@york.ac.uk cgw501 at york.ac.uk
Fri May 13 22:59:58 CEST 2005


Thanks! 

Your help has made me realise the problem is more complex than I first 
though though...I've included a small sample of an actual file I need to 
process. The structure is the same as in the full versions though; some 
lowercase, some uppercase, then some more lowercase. One is that I need to 
remove the lines of asterisks. I think I can do this with .isalpha(). 
Here's what I've written:

theAlignment = open('alignment.txt', 'r')

strippedList = []
for line in theAlignment:
    if line.isalpha()
        strippedList.append(line.strip('atgc'))

strippedFile = open ('stripped.txt', 'w')

for i in strippedList:
    strippedFile.write(i)

strippedFile.close()
theAlignment.close()


The other complication is that I need to retain the lowercase stuff at the 
start of each sequence (the sequences are aligned, so 'Scer' in the second 
block follows on from 'Scer' in the first etc.). Maybe the best thing to do 
would be to concatenate all the Scer, Spar, Smik and Sbay sequences bfore 
processing them? Also i need to get rid of '-' characters within the 
trailing lowercase, but keep the rest of them. So basically everything 
after the last capital letter only needs to go.

I'd really appreciate any thoughts, but don't worry if you've got better 
things to do.

Chris


The file:

Scer            ACTAACAAGCAAAATGTTTTGTTTCTCCTTTT-AAAATAGTACTGCTGTTTCTCAAGCTG
Spar            actaacaagcaaaatgttttgtttctcctttt-aaaatagtactgctgtttctcaagctg
Smik            actaacaagcaaaatgtttcttttctcttttttgaaatagtactgctgcttctcaagctg
Sbay            actaacaagcaaaaactttttgttttatt----gaaatagtactgctgtctctcaagctg
                ****  * ************** **   ********  ***   ***** *******  *

Scer            GGGGTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Spar            GGGGTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Smik            GGGGTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Sbay            GGGGTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAATTGCAAATTAACTGTG
                * ********** *********  **** *********   *  ** ***** ** ****

Scer            ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAaaggctttttt-ataa
Spar            ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAaaagctttttttataa
Smik            ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAgaagctttttctataa
Sbay            ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAgaagctttttctataa
                ******************************************** * *******  ****

Scer            actttttataatt----aacattaa-------agcaaaaacaacattgtaaagattaaca
Spar            actttttataata----aacatcaa-------agcaaaaacaacattgtaaagattaaca
Smik            actttttataatt----aacatcgacaaaaacgacaacaacaacattgtaaagattaaca
Sbay            actttttataacttagcaacaacaacaacaacaacatcaacaacattgtaaagattaaca
                ***********      ****   *         **  **********************


On May 13 2005, Max Noel wrote:

> 
> On May 13, 2005, at 20:36, cgw501 at york.ac.uk wrote:
> 
> > Hi,
> >
> > i am a Biology student taking some early steps with programming. I'm
> > currently trying to write a Python script to do some simple  
> > processing of a
> > gene sequence file.
> 
>      Welcome aboard!
> 
> > A line in the file looks like:
> > SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat
> >
> > Ther are many lines like this. What I want to do is read the file and
> > remove the trailing lowercase letters and create a new file  
> > containing the
> > remaining information. I have some ideas of how to do this (using the
> > isLower() method of the string module. I was hoping someone could  
> > help me
> > with the file handling. I was thinking I'd us .readlines() to get a  
> > list of
> > the lines, I'm not sure how to delete the right letters or write to  
> > a new
> > file. Sorry if this is trivially easy.
> 
>      First of all, you shouldn't use readlines() unless you really  
> need to have access to several lines at the same time. Loading the  
> entire file in memory eats up a lot of memory and scales up poorly.  
> Whenever possible, you should iterate over the file, like this:
> 
> 
> foo = open("foo.txt")
> for line in foo:
>      # do stuff with line...
> foo.close()
> 
> 
>      As for the rest of your problem, the strip() method of string  
> objects is what you're looking for:
> 
> 
>  >>> "SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat".strip 
> ("atgc")
> 'SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATC'
> 
> 
>      Combining those 2 pieces of advice should solve your problem.
> 
> 


More information about the Tutor mailing list