Wildcard String Comparisons: Set Pattern to a Wildcard Source

Tue Oct 5 16:22:38 EDT 2010

On 10/05/10 15:06, chaoticcranium at gmail.com wrote:
> On Oct 5, 3:38 pm, MRAB<pyt... at mrabarnett.plus.com>  wrote:
>> On 05/10/2010 20:03, chaoticcran... at gmail.com wrote:
>>
>>
>>
>>> So, I have a rather tricky string comparison problem: I want to search
>>> for a set pattern in a variable source.
>>
>>> To give you the context, I am searching for set primer sequences
>>> within a variable gene sequence. In addition to the non-degenerate A/G/
>>> C/T, the gene sequence could have degenerate bases that could encode
>>> for more than one base (for example, R means A or G, N means A or G or
>>> C or T). One brute force way to do it would be to generate every
>>> single non-degenerate sequence the degenerate sequence could mean and
>>> do my comparison with all of those, but that would of course be very
>>> space and time inefficient.
>>
>>> For the sake of simplicity, let's say I replace each degenerate base
>>> with a single wildcard character "?". We can do this because there are
>>> so many more non-degenerate bases that the probability of a degenerate
>>> mismatch is low if the nondegenerates in a primer match up.
>>
>>> So, my goal is to search for a small, set pattern (the primer) inside
>>> a large source with single wildcard characters (my degenerate gene).
>>
>>> The first thing that comes to my mind are regular expressions, but I'm
>>> rather n00bish when it comes to using them and I've only been able to
>>> find help online where the smaller search pattern has wildcards and
>>> the source is constant, such as here:
>>> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...
>>
>>> Of course, that's the reverse of my situation and the proposed
>>> solutions there won't work for me. So, could you help me out, oh great
>>> Python masters? *bows*
>>
>> Stand back, I'm going to try regex. :-)
>>
>> Both "A" and "R" in the variable sequence should match "A" in the
>> primer sequence, so "A" in the primer sequence should be replaced by
>> the character set "[AR]". The other bases should be replaced similarly.
>>
>> Use a simple dict lookup:
>>
>> wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
>>
>> and create the regex for the primer sequence:
>>
>> primer_pattern = re.compile("".join(wildcards[c] for c in primer))
>>
>> Would that work?
>
>
> Thank you for your response, MRAB.
>
> That's a rather clever way to do this sort of matching, but I actually
> forgot one other crucial thing in my problem description (and I'm
> hitting myself on the head for forgetting it!) - I need to know at
> what position in my gene the primer was found.

If you use the primer_pattern.search() method (which searches 
starting at all offsets) instead of .match()  (which only 
searches from the beginning), it should return a match object 
that has a .start() method to let you know the offset:

   m = primer_pattern.search(my_data)
   if m is None:
     print "Not found"
   else:
     print "Found at %i" % m.start()

-tkc