[Tutor] Simple string processing problem
cgw501@york.ac.uk
cgw501 at york.ac.uk
Fri May 13 22:59:58 CEST 2005
Thanks!
Your help has made me realise the problem is more complex than I first
though though...I've included a small sample of an actual file I need to
process. The structure is the same as in the full versions though; some
lowercase, some uppercase, then some more lowercase. One is that I need to
remove the lines of asterisks. I think I can do this with .isalpha().
Here's what I've written:
theAlignment = open('alignment.txt', 'r')
strippedList = []
for line in theAlignment:
if line.isalpha()
strippedList.append(line.strip('atgc'))
strippedFile = open ('stripped.txt', 'w')
for i in strippedList:
strippedFile.write(i)
strippedFile.close()
theAlignment.close()
The other complication is that I need to retain the lowercase stuff at the
start of each sequence (the sequences are aligned, so 'Scer' in the second
block follows on from 'Scer' in the first etc.). Maybe the best thing to do
would be to concatenate all the Scer, Spar, Smik and Sbay sequences bfore
processing them? Also i need to get rid of '-' characters within the
trailing lowercase, but keep the rest of them. So basically everything
after the last capital letter only needs to go.
I'd really appreciate any thoughts, but don't worry if you've got better
things to do.
Chris
The file:
Scer ACTAACAAGCAAAATGTTTTGTTTCTCCTTTT-AAAATAGTACTGCTGTTTCTCAAGCTG
Spar actaacaagcaaaatgttttgtttctcctttt-aaaatagtactgctgtttctcaagctg
Smik actaacaagcaaaatgtttcttttctcttttttgaaatagtactgctgcttctcaagctg
Sbay actaacaagcaaaaactttttgttttatt----gaaatagtactgctgtctctcaagctg
**** * ************** ** ******** *** ***** ******* *
Scer GGGGTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Spar GGGGTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Smik GGGGTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
Sbay GGGGTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAATTGCAAATTAACTGTG
* ********** ********* **** ********* * ** ***** ** ****
Scer ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAaaggctttttt-ataa
Spar ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAaaagctttttttataa
Smik ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAgaagctttttctataa
Sbay ACCACGTCCAATCTACCGATATTGCTGCTATGCAAAAATTATAAgaagctttttctataa
******************************************** * ******* ****
Scer actttttataatt----aacattaa-------agcaaaaacaacattgtaaagattaaca
Spar actttttataata----aacatcaa-------agcaaaaacaacattgtaaagattaaca
Smik actttttataatt----aacatcgacaaaaacgacaacaacaacattgtaaagattaaca
Sbay actttttataacttagcaacaacaacaacaacaacatcaacaacattgtaaagattaaca
*********** **** * ** **********************
On May 13 2005, Max Noel wrote:
>
> On May 13, 2005, at 20:36, cgw501 at york.ac.uk wrote:
>
> > Hi,
> >
> > i am a Biology student taking some early steps with programming. I'm
> > currently trying to write a Python script to do some simple
> > processing of a
> > gene sequence file.
>
> Welcome aboard!
>
> > A line in the file looks like:
> > SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat
> >
> > Ther are many lines like this. What I want to do is read the file and
> > remove the trailing lowercase letters and create a new file
> > containing the
> > remaining information. I have some ideas of how to do this (using the
> > isLower() method of the string module. I was hoping someone could
> > help me
> > with the file handling. I was thinking I'd us .readlines() to get a
> > list of
> > the lines, I'm not sure how to delete the right letters or write to
> > a new
> > file. Sorry if this is trivially easy.
>
> First of all, you shouldn't use readlines() unless you really
> need to have access to several lines at the same time. Loading the
> entire file in memory eats up a lot of memory and scales up poorly.
> Whenever possible, you should iterate over the file, like this:
>
>
> foo = open("foo.txt")
> for line in foo:
> # do stuff with line...
> foo.close()
>
>
> As for the rest of your problem, the strip() method of string
> objects is what you're looking for:
>
>
> >>> "SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat".strip
> ("atgc")
> 'SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATC'
>
>
> Combining those 2 pieces of advice should solve your problem.
>
>
More information about the Tutor
mailing list