# Match First Sequence in Regular Expression?

Dave Hansen iddw at hotmail.com
Thu Jan 26 18:01:02 CET 2006

```On Thu, 26 Jan 2006 16:26:57 GMT in comp.lang.python, "Roger L.
Cauvin" <roger at deadspam.com> wrote:

>"Christos Georgiou" <tzot at sil-tec.gr> wrote in message
>news:boqht19rs7946mtk5s64hqrieq44he5aq7 at 4ax.com...
[...]
>> Is this what you mean?
>>
>> ^[^a]*(a{3})(?:[^a].*)?\$
>
>Close, but the pattern should allow "arbitrary sequence of characters" that
>precede the alternating a's and b's to contain the letter 'a'.  In other
>words, the pattern should accept:
>
>"xayz123aaabbab"
>
>since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.
>

I don't know an RE is the best solution to this problem.  If I
understand the problem correctly, building a state machine to solve
this is trivial.  The following took about 5 minutes of coding:

---begin included file
# Define our states.
#   state is next state if character is 'a'
#   state is next state if character is 'b'
#   state is next state for any other character

# Accept state means we've found a match
Accept = []
for i in range(3):
Accept.append(Accept)

# Reject state means the string cannot match
Reject = []
for i in range(3):
Reject.append(Reject)

# Remaining states: Start, 'a' found, 'aa', 'aaa', and 'aaaa'
Start = [0,1,2]
a1 = [0,1,2]
a2 = [0,1,2]
a3 = [0,1,2]
a4 = [0,1,2]

# Start: looking for first 'a'
Start = a1
Start = Start
Start = Start

# a1: 1 'a' found so far
a1 = a2
a1 = Reject
a1 = Start

# a2: 'aa' found
a2 = a3
a2 = Reject
a2 = Start

# a3: 'aaa' found
a3 = a4
a3 = Accept
a3 = Start

# a4: four or more 'a' in a row
a4 = a4
a4 = Reject
a4 = Start

def detect(s):
"""
Return 1 if first substring aa*b has exactly 3 a's
Return 0 otherwise
"""
state = Start
for c in s:
if c == 'a':
state = state
elif c == 'b':
state = state
else:
state = state
if state is Accept:
return 1
return 0

print detect("xyza123abc")
print detect("xyzaaa123aabc")
print detect("xyzaa123aaabc")
print detect("xyza123aaaabc")

--- end included file ---

And I'm pretty sure it does what you need, though it's pretty naive.
Note that if '3' isn't a magic number, states a1, a2, a3, and a4 could
be re-implemented as a single state with a counter, but the logic
inside detect gets a little hairier.

I haven't timed it, but it's not doing anything other than simple
comparisons and assignments.  It's a little (OK, a lot) more code than
a simple RE, but I know it works.

HTH,
-=Dave

--
Change is inevitable, progress is not.

```

More information about the Python-list mailing list