[Tutor] Finding all locations of a sequence
Danny Yoo
dyoo at cs.wpi.edu
Mon Jun 18 18:17:37 CEST 2007
> Ok, what I have is a RNA sequence (about 5 million nucleotides
> [characters] long) and have (4100) subsequences (from another sequence)
> and the sub-sequences are 6 characters long, that I want to find in it.
Hi Lauren,
I am positive that this problem has been tackled before. Have you talked
to any of your fellow bioinformaticists? It might not be effective to ask
on Python-tutor because, for typical problems, regexes are sufficient.
For your particular scenario, regexes may not be, so you may want to ask
specialists like those on the Biopython mailing lists:
http://biopython.org/wiki/Mailing_lists
It may seem like a simple extension of regular expression search, but as
you may have noticed, feeding 4100 regex searches across that RNA sequence
is going to take some time. And trying to feed all of those subsequences
as a single regular expression (using ORs) is probably not going to work
too well either: the engine has limitations on how large the pattern can
be before it starts behaving very badly.
To scale to this problem, we need to do something different. What you're
probably looking for is more typically known as the keyword matching
problem, and there are algorithms specifically used for keyword matching.
For example, take a look at Nicholas Lehuen's Pytst package, which
implements ternary search trees:
http://nicolas.lehuen.com/index.php/category/Pytst
I've also written a different package that uses the "Aho-Corasick"
automata matcher, but to tell the truth I've let the code stale a bit, and
can't support it (too busy with other work). But if you're interested in
tinkering with it, it's here:
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/.
If you're going to do more string-matching stuff, I'd recommend a look
into "Algorithms on Strings, Trees, and Sequences".
http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198
It's one of the better books on bioinformatics string algorithms, and it
specifically covers this class of sequence-searching problems.
More information about the Tutor
mailing list