[Tutor] Splitting text

Terry Carroll carroll at tjc.com
Thu Jun 29 23:30:25 CEST 2006


On Thu, 29 Jun 2006, Apparao Anakapalli wrote:

> pattern = 'ATTTA'
> 
> I want to find the pattern in the sequence and count.
> 
> For instance in 'a' there are two 'ATTTA's. 

use re.findall:

>>> import re
>>> pat = "ATTTA"
>>> rexp=re.compile(pat)
>>> a = "TCCCTGCGGCGCATGAGTGACTGGCGTATTTAGCCCGTCACATTTA"
>>> print len(re.findall(rexp,a))
2
>>> b = "CCTGCGGCGCATGAGTGACTGGCGTATTTAGCCCGTCACAATTTAA"
>>> print len(re.findall(rexp,b))
2


Be aware, though, that findall finds non-overlapping occurances; and if 
overlapping occurances are important to you, it will fail:

>>> c = "ATTTATTTA"
>>> print len(re.findall(rexp,c))
1

The following method will count all occurances, even if they overlap:

def findall_overlap(regex, seq):
   resultlist=[]
   pos=0

   while True:
      result = regex.search(seq, pos)
      if result is None:
         break
      resultlist.append(seq[result.start():result.end()])
      pos = result.start()+1
   return resultlist

For example:

>>> print len(findall_overlap(rexp,c))
2




More information about the Tutor mailing list