[Tutor] Simple string processing problem

Mon May 16 14:03:25 CEST 2005

cgw501 at york.ac.uk wrote:
> Thanks! 
> 
> Your help has made me realise the problem is more complex than I first 
> though though...I've included a small sample of an actual file I need to 
> process. The structure is the same as in the full versions though; some 
> lowercase, some uppercase, then some more lowercase. One is that I need to 
> remove the lines of asterisks. I think I can do this with .isalpha(). 
> Here's what I've written:
> 
> theAlignment = open('alignment.txt', 'r')
> 
> strippedList = []
> for line in theAlignment:
>     if line.isalpha()
>         strippedList.append(line.strip('atgc'))
> 
> strippedFile = open ('stripped.txt', 'w')
> 
> for i in strippedList:
>     strippedFile.write(i)
> 
> strippedFile.close()
> theAlignment.close()

You can read and write in the same loop and avoid creating the intermediate list. Also I think you 
will need to strip the trailing newline (otherwise it blocks stripping the lower case chars) and 
then add it back:

theAlignment = open('alignment.txt', 'r')
strippedFile = open ('stripped.txt', 'w')

for line in theAlignment:
     if line.isalpha()
         strippedFile.write(line.strip('atgc\n'))
         strippedFile.write('\n')

strippedFile.close()
theAlignment.close()
> 
> 
> The other complication is that I need to retain the lowercase stuff at the 
> start of each sequence (the sequences are aligned, so 'Scer' in the second 
> block follows on from 'Scer' in the first etc.). 

You can use line.rstrip('atgc') to just strip from the right side. Though in the data you have 
shown, you don't actually have any lines that start with lower case letters. For example,
  >>> 'Scer            actttttataatt----aacattaa-------agcaaaaacaacattgtaaagattaaca'.strip('atgc')
'Scer            actttttataatt----aacattaa-------'

You might like to browse this page to see what else you can do with strings:
http://docs.python.org/lib/string-methods.html

Maybe the best thing to do
> would be to concatenate all the Scer, Spar, Smik and Sbay sequences bfore 
> processing them? Also i need to get rid of '-' characters within the 
> trailing lowercase, but keep the rest of them. So basically everything 
> after the last capital letter only needs to go.
> 
> I'd really appreciate any thoughts, but don't worry if you've got better 
> things to do.

Don't worry about asking beginner questions, that's what this list is for :-)

Kent