Match First Sequence in Regular Expression?
Dave Hansen
iddw at hotmail.com
Thu Jan 26 18:01:02 CET 2006
On Thu, 26 Jan 2006 16:26:57 GMT in comp.lang.python, "Roger L.
Cauvin" <roger at deadspam.com> wrote:
>"Christos Georgiou" <tzot at sil-tec.gr> wrote in message
>news:boqht19rs7946mtk5s64hqrieq44he5aq7 at 4ax.com...
[...]
>> Is this what you mean?
>>
>> ^[^a]*(a{3})(?:[^a].*)?$
>
>Close, but the pattern should allow "arbitrary sequence of characters" that
>precede the alternating a's and b's to contain the letter 'a'. In other
>words, the pattern should accept:
>
>"xayz123aaabbab"
>
>since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
>
I don't know an RE is the best solution to this problem. If I
understand the problem correctly, building a state machine to solve
this is trivial. The following took about 5 minutes of coding:
---begin included file
# Define our states.
# state[0] is next state if character is 'a'
# state[1] is next state if character is 'b'
# state[2] is next state for any other character
# Accept state means we've found a match
Accept = []
for i in range(3):
Accept.append(Accept)
# Reject state means the string cannot match
Reject = []
for i in range(3):
Reject.append(Reject)
# Remaining states: Start, 'a' found, 'aa', 'aaa', and 'aaaa'
Start = [0,1,2]
a1 = [0,1,2]
a2 = [0,1,2]
a3 = [0,1,2]
a4 = [0,1,2]
# Start: looking for first 'a'
Start[0] = a1
Start[1] = Start
Start[2] = Start
# a1: 1 'a' found so far
a1[0] = a2
a1[1] = Reject
a1[2] = Start
# a2: 'aa' found
a2[0] = a3
a2[1] = Reject
a2[2] = Start
# a3: 'aaa' found
a3[0] = a4
a3[1] = Accept
a3[2] = Start
# a4: four or more 'a' in a row
a4[0] = a4
a4[1] = Reject
a4[2] = Start
def detect(s):
"""
Return 1 if first substring aa*b has exactly 3 a's
Return 0 otherwise
"""
state = Start
for c in s:
if c == 'a':
state = state[0]
elif c == 'b':
state = state[1]
else:
state = state[2]
if state is Accept:
return 1
return 0
print detect("xyza123abc")
print detect("xyzaaa123aabc")
print detect("xyzaa123aaabc")
print detect("xyza123aaaabc")
--- end included file ---
And I'm pretty sure it does what you need, though it's pretty naive.
Note that if '3' isn't a magic number, states a1, a2, a3, and a4 could
be re-implemented as a single state with a counter, but the logic
inside detect gets a little hairier.
I haven't timed it, but it's not doing anything other than simple
comparisons and assignments. It's a little (OK, a lot) more code than
a simple RE, but I know it works.
HTH,
-=Dave
--
Change is inevitable, progress is not.
More information about the Python-list
mailing list