Need feedback on ORF-extracting code
Terry Reedy
tjreedy at udel.edu
Thu Aug 13 02:02:45 EDT 2009
gb345 wrote:
>
> Hi everyone. I'm relatively new to Python, and could use your
> feedback on the code below.
>
> First some nomenclature. A "nucleotide" is one of A, C, G, T, or
> U. (In practice, a sequence of such nucleotides never contains
> both T and U, but this fact is not important in what follows.) A
> "codon" is a sequence of 3 of these. A "stop codon" is any one of
> UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
> frame" in a containing sequence of nucleotides if their positions
> differ by a multiple of 3. An "open reading frame," or ORF, is
> defined as a maximal subsequence of nucleotides whose length is a
> multiple of 3, begins with either AUG or ATG, terminates right
> before a stop codon in the original sequence, and contains no stop
> codons that are in frame with the initial codon (AUG or ATG).
>
> The fact that ORFs have lengths that are multiples of 3 means that
> there are three possible "registers" for ORFs (depending the modulo
> 3 of their starting positions), and that ORFs in different registers
> can overlap. I'll refer to these registers as 0, 1, and 2, because
> they contain ORFs that are in frame with the codons at positions
> 0, 1, and 2 of the original sequence, respectively.
>
> In the code below, routine extract_orfs takes as input a string,
> assumed to be a sequence of nucleotides, and returns a tuple of
> tuples, describing ORFs. These ORFs can overlap.
>
> The helper routine _xo takes as input a string (a sequence of
> nucleotides), and an offset, and returns a tuple of tuples, again
> representing ORFs, but this time all in the same register, since
> they are all in frame with the position passed as the second argument
> to the function.
>
> I would appreciate your comments on this code. I feel that it is
> not as clear as it could be, but I don't know how to make it any
> clearer.
>
> (NOTE: I realize that, in all likelihood, publicly available Python
> code already exists for this. At the moment I'm more interested
> in improving my Python skills.)
>
>
> Many thanks in advance!
>
> Gabe
>
>
>
> # BEGINNING OF CODE
> import sys
> import re
>
> _start = r'A[TU]G'
> _stop = r'(?:[TU]A[AG]|[TU]GA)'
> _nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
> _codon = r'(?:[TUCAG]{3})'
> _orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
> _nonstop + '*)(' + _stop + ')', flags=re.I)
> _lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)
>
> def _xo(seq, pos):
> """
> Helper routine that finds all the non-overlapping in-frame orfs
> starting from a specific position in the input sequence.
>
> input: a string of letters in the set 'tucagTUCAG', and a starting
> position;
> output: a tuple of tuples; each internal tuple consists of a
> starting position, an orf, and the stop codon that
> terminates it.
> """
>
> ret = []
> while True:
> m = _orf_re.match(seq, pos)
> if not m:
> break
> orf = m.group(2)
> stop = m.group(3)
> assert len(orf) % 3 == 0
I am not currently familiar with the re module so I cannot comment in
detail. The assert should be a claim that the re should *never* match
anything other than a multiple of 3, so that it is a program bug if it
ever do so. If this is not true, you should use an if statement.
tjr
> ret.append((m.start() + len(m.group(1)), orf, stop))
> pos = m.end()
> return ret
>
> def extract_orfs(seq):
> """
> Extracts all (possibly overlapping) maximal open reading frames,
> defined as sequences beginning with AUG (or ATG), ending with an
> in-frame stop codon, and containing no in-frame stop codons
> in-between.
>
> input: a string of letters in the set 'tucagTUCAG';
> output: a tuple of tuples; each internal tuple consists of a
> starting position, an orf, and the stop codon that
> terminates it.
> """
>
> m = _lead_re.match(seq)
> if not m:
> return ()
> pos = m.start()
> ret = []
>
> for i in range(min(3, len(seq))):
> ret.extend(_xo(seq, pos + i))
> ret.sort()
> return tuple(ret)
>
> # END OF CODE
More information about the Python-list
mailing list