Wildcard String Comparisons: Set Pattern to a Wildcard Source

Tue Oct 5 16:55:24 EDT 2010

"chaoticcranium at gmail.com" <chaoticcranium at gmail.com> writes:

> So, I have a rather tricky string comparison problem: I want to search
> for a set pattern in a variable source.
>
> To give you the context, I am searching for set primer sequences
> within a variable gene sequence. In addition to the non-degenerate A/G/
> C/T, the gene sequence could have degenerate bases that could encode
> for more than one base (for example, R means A or G, N means A or G or
> C or T). One brute force way to do it would be to generate every
> single non-degenerate sequence the degenerate sequence could mean and
> do my comparison with all of those, but that would of course be very
> space and time inefficient.
>
> For the sake of simplicity, let's say I replace each degenerate base
> with a single wildcard character "?". We can do this because there are
> so many more non-degenerate bases that the probability of a degenerate
> mismatch is low if the nondegenerates in a primer match up.
>
> So, my goal is to search for a small, set pattern (the primer) inside
> a large source with single wildcard characters (my degenerate gene).
>
> The first thing that comes to my mind are regular expressions, but I'm
> rather n00bish when it comes to using them and I've only been able to
> find help online where the smaller search pattern has wildcards and
> the source is constant, such as here:
> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html
>
> Of course, that's the reverse of my situation and the proposed
> solutions there won't work for me. So, could you help me out, oh great
> Python masters? *bows*

Knuth-Morris-Pratt to the rescue! This is a well known algorithm for
searching for occurences of a substring in a text. Just change the
search algorithm slightly so that it takes wildcards in the text to
search into account.  Here is a quick-and-dirty implementation in
Python:

--------------------------------------------------
def build_table(word):
    """build the Knuth-Morris-Pratt partial match table"""
    table = [-1, 0]
    wpos = 0
    while len(table) < len(word):
        if word[len(table) - 1] == word[wpos]:
            wpos += 1
            table.append(wpos)
        elif wpos > 0:
            wpos = table[wpos]
        else:
            table.append(0)
    return table

def search(word, text, wildcard="?"):
    """Perform Knuth-Morris-Pratt search with wildcard for text"""
    table = build_table(word)
    ti, wi = 0, 0
    while ti <= len(text) - len(word):
        c = text[ti + wi]
        if c == wildcard or c == word[wi]:
            if wi + 1 == len(word):
                return ti
            wi += 1
        else:
            ti += wi - table[wi]
            if wi:
                wi = table[wi]
    return None

examples = [
    ("spam", "sa?ps?a?mp"),
    ("ababc", "??aba?ba??b??a")
    ]

for word, text in examples:
    i = search(word, text)
    found = text[i:i+len(word)]
    print "found [%s] in [%s] at %s: %s" % (word, text, i, found)
--------------------------------------------------

Output:

found [spam] in [sa?ps?a?mp] at 4: s?a?
found [ababc] in [??aba?ba??b??a] at 7: a??b?

You could even change the line:

    if c == wildcard or c == word[wi]

to something that checks that e.g. when c == "N", word[wi] == "A" or
"G", etc...

Knuth-Morris-Pratt is such that the main loop in "search()" is
guaranteed to run no more than 2*n times, where n is the length of
"text".

OTOH Python is not the best language to implement this.  If it was me,
I'd do it in C.  It wouldn't take long at all and it would probably be a
lot faster (maybe 100x, that's a pure guess).

Warning: I didn't really check my code more than the provided
examples...

HTH

-- 
Arnaud