[Tutor] Finding all locations of a sequence

Danny Yoo dyoo at cs.wpi.edu
Mon Jun 18 18:17:37 CEST 2007



> Ok, what I have is a RNA sequence (about 5 million nucleotides 
> [characters] long) and have (4100) subsequences (from another sequence) 
> and the sub-sequences are 6 characters long, that I want to find in it.

Hi Lauren,

I am positive that this problem has been tackled before.  Have you talked 
to any of your fellow bioinformaticists?  It might not be effective to ask 
on Python-tutor because, for typical problems, regexes are sufficient. 
For your particular scenario, regexes may not be, so you may want to ask 
specialists like those on the Biopython mailing lists:

     http://biopython.org/wiki/Mailing_lists

It may seem like a simple extension of regular expression search, but as 
you may have noticed, feeding 4100 regex searches across that RNA sequence 
is going to take some time.  And trying to feed all of those subsequences 
as a single regular expression (using ORs) is probably not going to work 
too well either: the engine has limitations on how large the pattern can 
be before it starts behaving very badly.


To scale to this problem, we need to do something different.  What you're 
probably looking for is more typically known as the keyword matching 
problem, and there are algorithms specifically used for keyword matching. 
For example, take a look at Nicholas Lehuen's Pytst package, which 
implements ternary search trees:

     http://nicolas.lehuen.com/index.php/category/Pytst

I've also written a different package that uses the "Aho-Corasick" 
automata matcher, but to tell the truth I've let the code stale a bit, and 
can't support it (too busy with other work).  But if you're interested in 
tinkering with it, it's here: 
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/.


If you're going to do more string-matching stuff, I'd recommend a look 
into "Algorithms on Strings, Trees, and Sequences".

     http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198

It's one of the better books on bioinformatics string algorithms, and it 
specifically covers this class of sequence-searching problems.


More information about the Tutor mailing list