[Tutor] Analysing genetic code (DNA) using python

Mon Mar 6 22:28:39 CET 2006

Hi Anna,

Let's see one thing at a time.

wrote:
> ** If the three base pairs were UUU the value assigned to it (from the 
> codon value table) would be 0.296 

This can be done in Python, and one appropriate tool may be a dictionary 
such as:
 >>>
dna_table = {
         "UUU" : 0.296,
         "GGG" : 0.3
}
 >>> print dna_table["UUU"]
0.296
 >>>

You'd use the table to look the corresponding value.

Of course, you'd have to so some programming to get your ASCII 
translation tables converted to a dictionary.

> The program has to read all the sequence three pairs at a time, then I 
> want to get all the values for each codon, multiply them together and 
> put them to the power of 1 / the length of the sequence in codons 
> (which is the length of the whole sequence divided by three). 

Rational powers are supported by the pow() function in module math, see:

 >>> import math
 >>> math.pow(10,0.5)
3.1622776601683791

> 
> 
> However, to make things even more complicated, the notebook sequences 
> are in lowercase and the codon value table is in uppercase, so the 
> sequences need to be converted into uppercase. Also, the Ts in the DNA 
> sequences need to be changed to Us (again to match the codon value 
> table). And finally, before the DNA sequences are read and analysed I 
> need to remove the first 50 codons (i.e. the first 150 letters) and the 
> last 20 codons (the last 60 letters) from the DNA sequence.

These problems are very straightforward in python with string methods, 
take a looks at the docs at:

http://docs.python.org/lib/string-methods.html

  I've also
> been having problems ensuring the program reads ALL the sequence 3 
> letters at a time. 
> 

Line endings may be causing you problems if you try to read line by line.

Something like this should do:

#read 100 characters from your file
buff = fileo.read(100)

#eliminate line endings
buff = buff.replace("\n", "")

#read three new characters
word = buff[0:4]

#consume part of the buffer
buff = buff[4:]

This is slow and can be polished. One approach could be reading 
characters into a list and then consuming that list (strings are a bit 
harder since you cannot change them) Also, the last expressions will 
throw an exception when the list is too short, and you'll have to read 
another chunk from the file.

Hope this all helps, please let us know how far have you got and when 
you get stuck so we can help.

Hugo