Need feedback on ORF-extracting code

Wed Aug 12 18:57:33 EDT 2009

Hi everyone.  I'm relatively new to Python, and could use your
feedback on the code below.

First some nomenclature.  A "nucleotide" is one of A, C, G, T, or
U.  (In practice, a sequence of such nucleotides never contains
both T and U, but this fact is not important in what follows.)  A
"codon" is a sequence of 3 of these.  A "stop codon" is any one of
UAA, UAG, UGA, TAA, TAG, or TGA.  Two codons are said to be "in
frame" in a containing sequence of nucleotides if their positions
differ by a multiple of 3.  An "open reading frame," or ORF, is
defined as a maximal subsequence of nucleotides whose length is a
multiple of 3, begins with either AUG or ATG, terminates right
before a stop codon in the original sequence, and contains no stop
codons that are in frame with the initial codon (AUG or ATG).

The fact that ORFs have lengths that are multiples of 3 means that
there are three possible "registers" for ORFs (depending the modulo
3 of their starting positions), and that ORFs in different registers
can overlap.  I'll refer to these registers as 0, 1, and 2, because
they contain ORFs that are in frame with the codons at positions
0, 1, and 2 of the original sequence, respectively.

In the code below, routine extract_orfs takes as input a string,
assumed to be a sequence of nucleotides, and returns a tuple of
tuples, describing ORFs.  These ORFs can overlap.

The helper routine _xo takes as input a string (a sequence of
nucleotides), and an offset, and returns a tuple of tuples, again
representing ORFs, but this time all in the same register, since
they are all in frame with the position passed as the second argument
to the function.

I would appreciate your comments on this code.  I feel that it is
not as clear as it could be, but I don't know how to make it any
clearer.

(NOTE: I realize that, in all likelihood, publicly available Python
code already exists for this.  At the moment I'm more interested
in improving my Python skills.)

Many thanks in advance!

Gabe

# BEGINNING OF CODE
import sys
import re

_start = r'A[TU]G'
_stop = r'(?:[TU]A[AG]|[TU]GA)'
_nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
_codon = r'(?:[TUCAG]{3})'
_orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
                     _nonstop + '*)(' + _stop + ')', flags=re.I)
_lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

def _xo(seq, pos):
    """
    Helper routine that finds all the non-overlapping in-frame orfs
    starting from a specific position in the input sequence.

    input: a string of letters in the set 'tucagTUCAG', and a starting
           position;
    output: a tuple of tuples; each internal tuple consists of a
            starting position, an orf, and the stop codon that
            terminates it.
    """

    ret = []
    while True:
        m = _orf_re.match(seq, pos)
        if not m:
            break
        orf = m.group(2)
        stop = m.group(3)
        assert len(orf) % 3 == 0
        ret.append((m.start() + len(m.group(1)), orf, stop))
        pos = m.end()
    return ret

def extract_orfs(seq):
    """
    Extracts all (possibly overlapping) maximal open reading frames,
    defined as sequences beginning with AUG (or ATG), ending with an
    in-frame stop codon, and containing no in-frame stop codons
    in-between.

    input: a string of letters in the set 'tucagTUCAG';
    output: a tuple of tuples; each internal tuple consists of a
            starting position, an orf, and the stop codon that
            terminates it.
    """

    m = _lead_re.match(seq)
    if not m:
        return ()
    pos = m.start()
    ret = []

    for i in range(min(3, len(seq))):
        ret.extend(_xo(seq, pos + i))
    ret.sort()
    return tuple(ret)

# END OF CODE