[Tutor] Finding all locations of a sequence

Fri Jun 15 00:06:26 CEST 2007

On Thu, 14 Jun 2007, Lauren wrote:

> Subseq  AAAAAU can bind to UUUUUA (which is normal) and UUUUUG (not so
> normal) and I want to know where UUUUUA, and UUUUUG are in the large
> RNA sequence, and the locations to show up as one...thing.

How about something like this?

========================================================================

def seqsearch(seq, targets):
   """
   return a list of match objects, each of which identifies where any of
   the targets are found in the string seq
    seq: string to be searched
    targets: list or tuple of alternate targets to be searched

   note: re.findall is not used, because it wont catch overlaps
   """

   import re
   resultlist=[]
   pos=0
   regext_text = "|".join(targets)
   regex = re.compile(regext_text)
   while True:
      result = regex.search(seq, pos)
      if result is None:
         break
      resultlist.append(result)
      pos = result.start()+1
   return resultlist

targets = ["UUUUUA", "UUUUUG"]
sequence="UUCAAUUUGATACCAUUUUUAGCUUCCGUUUUUGCGATACCAUUUUAGCGU"
#                        ++++++       ++++++
#         0         1         2         3         4         5
#         012345678901234567890123456789012345678901234567890
# note: matches at 15 & 28
matches = seqsearch(sequence, targets)
for m in matches:
   print "match %s found at location %s" % (sequence[m.start():m.end()],
                                            m.start()) 
========================================================================

This prints, as expected:

match UUUUUA found at location 15
match UUUUUG found at location 28