[Tutor] Finding all locations of a sequence (fwd)

Mon Jun 18 22:54:53 CEST 2007

Hi everyone,

Can someone help Lauren?  I apologize for this, but I am very time 
constrained at this moment, and won't be able to reply back for at least a 
few hours.  His question is below.  Thanks!

---------- Forwarded message ----------
Date: Mon, 18 Jun 2007 16:41:39 -0400
From: Lauren <laurenb01 at gmail.com>
To: Danny Yoo <dyoo at cs.wpi.edu>
Subject: Re: [Tutor] Finding all locations of a sequence

Ok, these may be useful. I do have a potentially embarrassing problem,
however I will preface this with I'm practically computer illiterate. Ok
after I download the one you wrote (which I think I understand better than
the one listed previous...famous last words, I'm sure), but when I ask to
import the ahocorasick module, it says it can't find it :( Is it possible to
get step by step instructions on how to open the module? Do I need something
other than the latest download for it?
Again, I'm not good at this programming thing, so I'm sure I'm missing
something obvious

Thank you for your help,

Lauren

On 18/06/07, Danny Yoo <dyoo at cs.wpi.edu> wrote:
>
>
>
>> Ok, what I have is a RNA sequence (about 5 million nucleotides
>> [characters] long) and have (4100) subsequences (from another sequence)
>> and the sub-sequences are 6 characters long, that I want to find in it.
>
> Hi Lauren,
>
> I am positive that this problem has been tackled before.  Have you talked
> to any of your fellow bioinformaticists?  It might not be effective to ask
> on Python-tutor because, for typical problems, regexes are sufficient.
> For your particular scenario, regexes may not be, so you may want to ask
> specialists like those on the Biopython mailing lists:
>
>      http://biopython.org/wiki/Mailing_lists
>
> It may seem like a simple extension of regular expression search, but as
> you may have noticed, feeding 4100 regex searches across that RNA sequence
> is going to take some time.  And trying to feed all of those subsequences
> as a single regular expression (using ORs) is probably not going to work
> too well either: the engine has limitations on how large the pattern can
> be before it starts behaving very badly.
>
>
> To scale to this problem, we need to do something different.  What you're
> probably looking for is more typically known as the keyword matching
> problem, and there are algorithms specifically used for keyword matching.
> For example, take a look at Nicholas Lehuen's Pytst package, which
> implements ternary search trees:
>
>      http://nicolas.lehuen.com/index.php/category/Pytst
>
> I've also written a different package that uses the "Aho-Corasick"
> automata matcher, but to tell the truth I've let the code stale a bit, and
> can't support it (too busy with other work).  But if you're interested in
> tinkering with it, it's here:
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/<http://hkn.eecs.berkeley.edu/%7Edyoo/python/ahocorasick/>
> .
>
>
> If you're going to do more string-matching stuff, I'd recommend a look
> into "Algorithms on Strings, Trees, and Sequences".
>
>      http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521585198
>
> It's one of the better books on bioinformatics string algorithms, and it
> specifically covers this class of sequence-searching problems.
>

-- 
Lauren

Laurenb01 at gmail.com