[Tutor] Finding all locations of a sequence
Alan Gauld
alan.gauld at btinternet.com
Fri Jun 15 01:08:27 CEST 2007
"Lauren" <laurenb01 at gmail.com> wrote
Caveat: I am not into the realms of DNA sequencing so this may
not be viable but...
> Say I have chicken and I want to know where it occurs in a string of
> words, but I want it to match to both chicken and poultry and have
> the
> output of:
>
> chicken (locations of chicken and poultry in the string)
When searching for more than one pattern at a time I'd go
for a regex. A simple string search is faster on its own but
a single regex search will typically be faster than a repeated
string search.
For the simple case above a search for (chicken)|(poultry)
should work:
>>> import re
>>> s = ''' there are a lot of chickens in my poultry farm but
... very few could be called a spring chicken'''
...
>>> regex = '(chicken)|(poultry)'
>>> r = re.compile(regex)
...
>>> r.findall(s)
...
[('chicken', ''), ('', 'poultry'), ('chicken', '')]
>>> [match for match in r.finditer(s)]
[<_sre.SRE_Match object at 0x01E75920>, <_sre.SRE_Match object at
0x01E758D8>, <_sre.SRE_Match object at 0x01E75968>]
>>>
The match objects will let you find the location in the original
string which I suspect you will need?
> The string I'm dealing with is really large, so whatever will get
> through it the fastest is ideal for me.
Again I expect a regex to be fastest for multiple seach criteria
over a single pass. Now what your regex will look like for
R/DNA sequences I have no idea, but if you can describe it I'm
sure somebody here can help formulate a suitable pattern
--
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld.
More information about the Tutor
mailing list