Wildcard String Comparisons: Set Pattern to a Wildcard Source

Tue Oct 5 16:23:56 EDT 2010

On 05/10/2010 21:06, chaoticcranium at gmail.com wrote:
> On Oct 5, 3:38 pm, MRAB<pyt... at mrabarnett.plus.com>  wrote:
>> On 05/10/2010 20:03, chaoticcran... at gmail.com wrote:
>>
>>
>>
>>> So, I have a rather tricky string comparison problem: I want to search
>>> for a set pattern in a variable source.
>>
>>> To give you the context, I am searching for set primer sequences
>>> within a variable gene sequence. In addition to the non-degenerate A/G/
>>> C/T, the gene sequence could have degenerate bases that could encode
>>> for more than one base (for example, R means A or G, N means A or G or
>>> C or T). One brute force way to do it would be to generate every
>>> single non-degenerate sequence the degenerate sequence could mean and
>>> do my comparison with all of those, but that would of course be very
>>> space and time inefficient.
>>
>>> For the sake of simplicity, let's say I replace each degenerate base
>>> with a single wildcard character "?". We can do this because there are
>>> so many more non-degenerate bases that the probability of a degenerate
>>> mismatch is low if the nondegenerates in a primer match up.
>>
>>> So, my goal is to search for a small, set pattern (the primer) inside
>>> a large source with single wildcard characters (my degenerate gene).
>>
>>> The first thing that comes to my mind are regular expressions, but I'm
>>> rather n00bish when it comes to using them and I've only been able to
>>> find help online where the smaller search pattern has wildcards and
>>> the source is constant, such as here:
>>> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...
>>
>>> Of course, that's the reverse of my situation and the proposed
>>> solutions there won't work for me. So, could you help me out, oh great
>>> Python masters? *bows*
>>
>> Stand back, I'm going to try regex. :-)
>>
>> Both "A" and "R" in the variable sequence should match "A" in the
>> primer sequence, so "A" in the primer sequence should be replaced by
>> the character set "[AR]". The other bases should be replaced similarly.
>>
>> Use a simple dict lookup:
>>
>> wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
>>
>> and create the regex for the primer sequence:
>>
>> primer_pattern = re.compile("".join(wildcards[c] for c in primer))
>>
>> Would that work?
>
>
> Thank you for your response, MRAB.
>
> That's a rather clever way to do this sort of matching, but I actually
> forgot one other crucial thing in my problem description (and I'm
> hitting myself on the head for forgetting it!) - I need to know at
> what position in my gene the primer was found.
>
> As far as I know (and I'm a regex n00b, so please tell me if I'm
> wrong), you can't use string's find() on a regex and regex's match()
> does not return a position in the regex. I understand there are
> elements of in regular expressions that expand to variable numbers of
> characters so a "position number" in a regular expression is often a
> meaningless concept. Here, however, my regular expression has a 1 to 1
> correspondence since each degenerate base should occupy only one
> wildcard slot. In this particular case, a position number is
> meaningful AND I need to know it for my program.
>
> Now. . .is there anything we can do about that?

A successful search returns a match object. That has methods including
.start(), which returns the start position of the match. It's all in
the documentation.