[Tutor] Analysing genetic code (DNA) using python

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Mar 6 20:25:39 CET 2006

> I have many notepad documents that all contain long chunks of genetic
> code. They look something like this:
> atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag

[data example cut]

> Basically, I want to design a program using python that can open and
> read these documents.

Hello sjw,

Ok, let's stop at this point.

It sounds like you want to design a function that takes in a string of
genetic code, and calculate the
"number_I_dont_understand_because_Im_not_a_biologist", or for short,

So let's at least start off by writing a template:

def calculate_magical_number(codon):
    """calculate_magical_number: string -> number

       Calculates some magical number given a single codon.
    ## ... [fill me in]

Let's leave it as that for the moment.

> However, I want them to be read 3 base pairs at a time (to analyse them
> codon by codon) and find the value that each codon has a value assigned
> to it. An example of this is below:
> ** If the three base pairs were UUU the value assigned to it (from the
> codon value table) would be 0.296

Ok, that sound like we can make a few test examples.  Let's imagine for
the moment that we did have this program.  We'd like to be able to say

    calculate_magic_number("UUU") == 0.296

Let's keep that test case in mind.

Can we think of other simple test examples we can think of?  Write them

Would you be able to write calculate_magical_number() at this point?  Can
you think of a better name than 'calculate_magical_number'?

> However, to make things even more complicated, the notebook sequences
> are in lowercase and the codon value table is in uppercase, so the
> sequences need to be converted into uppercase.

So simplify the problem.  Close your eyes and ignore the other
requirements entirely.  *grin*

But seriously, don't look at those requirements at all yet. Fix the input
to something simple.  If you can't get calculate_magic_number() working,
there's really no point.  Put in a positive light, if you can get
calculate_magic_number() right, you're much closer to success.

The simple subproblem above completely ignores the fact that you want to
do this on whole sequences, not just single codons, but that's ok: you can
tackle that problem later.  And you can always write helper functions to
help sanitize the ugly, messy input and turn it into something that
calculate_magic_number can handle.

The point is: you must try to write your program iteratively:  get a
simple, correct subprogram working and tested.  Once you have this, you
can then start adding more features and helper functions to capture more
of the real problem.  You end up making progress at all steps, rather than
cross your fingers and hope that it all just fits together at the end.

But if you try building all at once, without any kind of scaffolding or
minimal testing, you'll bound to have the whole structure fall apart, and
you won't probably have a good sense of what part is breaking.

> I've tried various ways of doing this but keep coming unstuck along the
> way. Has anyone got any suggestions for how they would tackle this
> problem?

The recommendations presented here come from material in:


If you have time reading through it, I'd highly recommend it.  It's not
Python, but it does talk about program design.

Good luck to you.

More information about the Tutor mailing list