Need feedback on ORF-extracting code

Thu Aug 13 02:02:45 EDT 2009

gb345 wrote:
> 
> Hi everyone.  I'm relatively new to Python, and could use your
> feedback on the code below.
> 
> First some nomenclature.  A "nucleotide" is one of A, C, G, T, or
> U.  (In practice, a sequence of such nucleotides never contains
> both T and U, but this fact is not important in what follows.)  A
> "codon" is a sequence of 3 of these.  A "stop codon" is any one of
> UAA, UAG, UGA, TAA, TAG, or TGA.  Two codons are said to be "in
> frame" in a containing sequence of nucleotides if their positions
> differ by a multiple of 3.  An "open reading frame," or ORF, is
> defined as a maximal subsequence of nucleotides whose length is a
> multiple of 3, begins with either AUG or ATG, terminates right
> before a stop codon in the original sequence, and contains no stop
> codons that are in frame with the initial codon (AUG or ATG).
> 
> The fact that ORFs have lengths that are multiples of 3 means that
> there are three possible "registers" for ORFs (depending the modulo
> 3 of their starting positions), and that ORFs in different registers
> can overlap.  I'll refer to these registers as 0, 1, and 2, because
> they contain ORFs that are in frame with the codons at positions
> 0, 1, and 2 of the original sequence, respectively.
> 
> In the code below, routine extract_orfs takes as input a string,
> assumed to be a sequence of nucleotides, and returns a tuple of
> tuples, describing ORFs.  These ORFs can overlap.
> 
> The helper routine _xo takes as input a string (a sequence of
> nucleotides), and an offset, and returns a tuple of tuples, again
> representing ORFs, but this time all in the same register, since
> they are all in frame with the position passed as the second argument
> to the function.
> 
> I would appreciate your comments on this code.  I feel that it is
> not as clear as it could be, but I don't know how to make it any
> clearer.
> 
> (NOTE: I realize that, in all likelihood, publicly available Python
> code already exists for this.  At the moment I'm more interested
> in improving my Python skills.)
> 
> 
> Many thanks in advance!
> 
> Gabe
> 
> 
> 
> # BEGINNING OF CODE
> import sys
> import re
> 
> _start = r'A[TU]G'
> _stop = r'(?:[TU]A[AG]|[TU]GA)'
> _nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
> _codon = r'(?:[TUCAG]{3})'
> _orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
>                      _nonstop + '*)(' + _stop + ')', flags=re.I)
> _lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)
> 
> def _xo(seq, pos):
>     """
>     Helper routine that finds all the non-overlapping in-frame orfs
>     starting from a specific position in the input sequence.
> 
>     input: a string of letters in the set 'tucagTUCAG', and a starting
>            position;
>     output: a tuple of tuples; each internal tuple consists of a
>             starting position, an orf, and the stop codon that
>             terminates it.
>     """
>     
>     ret = []
>     while True:
>         m = _orf_re.match(seq, pos)
>         if not m:
>             break
>         orf = m.group(2)
>         stop = m.group(3)
>         assert len(orf) % 3 == 0

I am not currently familiar with the re module so I cannot comment in 
detail. The assert should be a claim that the re should *never* match 
anything other than a multiple of 3, so that it is a program bug if it 
ever do so.  If this is not true, you should use an if statement.

tjr

>         ret.append((m.start() + len(m.group(1)), orf, stop))
>         pos = m.end()
>     return ret
> 
> def extract_orfs(seq):
>     """
>     Extracts all (possibly overlapping) maximal open reading frames,
>     defined as sequences beginning with AUG (or ATG), ending with an
>     in-frame stop codon, and containing no in-frame stop codons
>     in-between.
> 
>     input: a string of letters in the set 'tucagTUCAG';
>     output: a tuple of tuples; each internal tuple consists of a
>             starting position, an orf, and the stop codon that
>             terminates it.
>     """
> 
>     m = _lead_re.match(seq)
>     if not m:
>         return ()
>     pos = m.start()
>     ret = []
> 
>     for i in range(min(3, len(seq))):
>         ret.extend(_xo(seq, pos + i))
>     ret.sort()
>     return tuple(ret)
> 
> # END OF CODE