Regex Speed

John Machin sjmachin at lexicon.net
Fri Feb 23 19:54:33 EST 2007


On Feb 24, 10:15 am, garri... at gmail.com wrote:
> On Feb 21, 10:34 am, garri... at gmail.com wrote:
>
> > On Feb 20, 6:14 pm, Pop User <popu... at christest2.dc.k12us.com> wrote:
> > >http://swtch.com/~rsc/regexp/regexp1.html
>
> Going back a bit on a tangent, the author of this citation states that
> any regex can be expressed as a  DFA machine. However, while
> investigating this more I appear to have found one example of a regex
> which breaks this assumption.
>
> "ab+c|abd"
>
> Am I correct?

No.

> Can you think of a deterministic method of computing
> this expression?

Firstly rewrite a bit:

ab+c|abd
a(b+c|bd)
a(bb*c|bd)
ab(b*c|d)
ab(b+c|c|d)

Here's a DFA that recognises that:
State 0:
    a -> 1
State 1:
    b -> 2
State 2:
    b -> 3 # start of the b+c branch
    c -> 4 # the c branch
    d -> 4 # the d branch
State 3:
    b -> 3
    c -> 4
State 4:
    accepting state

> It would be easier with a NFA machine,

What is "It"?

> but given that
> the Python method of computing RE's involves pre-compiling a re
> object,

AFAIK all serious regex engines precompile a regex into an internal
representation which can be saved for reuse.

> optimizing the matching engine

What does that mean?

> would make the most sense to
> me.

Compared to what alternatives?

>
> Here's what I have so far:

[big snip]

> >>> a = compile("ab+c|abd")
[snip]
> >>> a.match("abbd")
> Match!

Bzzzt. Neither "ab+c" nor "abd" match a prefix of "abbd".

HTH,
John




More information about the Python-list mailing list