Match First Sequence in Regular Expression?

Dave Hansen iddw at hotmail.com
Thu Jan 26 12:01:02 EST 2006


On Thu, 26 Jan 2006 16:26:57 GMT in comp.lang.python, "Roger L.
Cauvin" <roger at deadspam.com> wrote:

>"Christos Georgiou" <tzot at sil-tec.gr> wrote in message 
>news:boqht19rs7946mtk5s64hqrieq44he5aq7 at 4ax.com...
[...]
>> Is this what you mean?
>>
>> ^[^a]*(a{3})(?:[^a].*)?$
>
>Close, but the pattern should allow "arbitrary sequence of characters" that 
>precede the alternating a's and b's to contain the letter 'a'.  In other 
>words, the pattern should accept:
>
>"xayz123aaabbab"
>
>since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
>

I don't know an RE is the best solution to this problem.  If I
understand the problem correctly, building a state machine to solve
this is trivial.  The following took about 5 minutes of coding:

---begin included file
# Define our states.  
#   state[0] is next state if character is 'a'
#   state[1] is next state if character is 'b'
#   state[2] is next state for any other character

# Accept state means we've found a match
Accept = []
for i in range(3):
    Accept.append(Accept)

# Reject state means the string cannot match
Reject = []
for i in range(3):
    Reject.append(Reject)

# Remaining states: Start, 'a' found, 'aa', 'aaa', and 'aaaa'
Start = [0,1,2]
a1 = [0,1,2]
a2 = [0,1,2]
a3 = [0,1,2]
a4 = [0,1,2]

# Start: looking for first 'a'
Start[0] = a1
Start[1] = Start
Start[2] = Start

# a1: 1 'a' found so far
a1[0] = a2
a1[1] = Reject
a1[2] = Start

# a2: 'aa' found
a2[0] = a3
a2[1] = Reject
a2[2] = Start

# a3: 'aaa' found
a3[0] = a4
a3[1] = Accept
a3[2] = Start

# a4: four or more 'a' in a row
a4[0] = a4
a4[1] = Reject
a4[2] = Start

def detect(s):
    """
    Return 1 if first substring aa*b has exactly 3 a's
    Return 0 otherwise
    """
    state = Start
    for c in s:
        if c == 'a':
            state = state[0]
        elif c == 'b':
            state = state[1]
        else:
            state = state[2]
        if state is Accept:
            return 1
    return 0

print detect("xyza123abc")
print detect("xyzaaa123aabc")
print detect("xyzaa123aaabc")
print detect("xyza123aaaabc")

--- end included file ---

And I'm pretty sure it does what you need, though it's pretty naive.
Note that if '3' isn't a magic number, states a1, a2, a3, and a4 could
be re-implemented as a single state with a counter, but the logic
inside detect gets a little hairier.

I haven't timed it, but it's not doing anything other than simple
comparisons and assignments.  It's a little (OK, a lot) more code than
a simple RE, but I know it works.

HTH,
                                        -=Dave

-- 
Change is inevitable, progress is not.



More information about the Python-list mailing list