split hangs

Wed Jan 3 22:21:50 EST 2001

[Tim, tweaks Davide]
> Then you had better start reading better stuff <wink>.  I'll repeat
> the original recommendation:  Jeffrey [note: I erroneously said
> Jonathan before] Friedl's excellent book "Mastering Regular
> Expressions" (O'Reilly).

[davide at mercurio.localdomain]
> Sorry, but the dragon book (Aho Sethi, Ullman), Introduction to
> automata theory, languages and computation (Hopcroft, Ullman) and
> thousand of others books on language theory are surely a better
> reference <wink>.

Those are excellent references for the Theory of Regular Languages, but are
pretty much irrelevant to the practice of regular expressions.  They're not
going to teach you anything about the pragmatics you need to learn if you
want to use regexps effectively in Python, Perl, Emacs, etc.  The
"theoretical best" algorithms rarely get implemented in real life.  For
example, take a construction that promises marvelous big-O improvement, but
at the up-front cost of doing work proportional to the alphabet size
(which-- for any actual alphabet --is a fixed and usually <wink> finite
constant, and so doesn't contribute to big-O).  In a Unicode world (w/ a 64K
alphabet), the result is likely of no practical use.

That doesn't mean I despise theory (quite the contrary), it means that if
you're asking (as you did) why some regexp took exponential time in Python,
your notion of "better" is inverted if it prevents you from discovering the
answer.  The books you favor aren't going to tell you; Friedl will (I
cheerfully agree that Aho et alia will tell you a world of stuff Friedl
won't though -- just not stuff that will answer your question).

>> There are two "obvious" ways to implement a regexp engine.

> Indeed, regexp matchig is quite simple.

No, Regexp Matching is simple.  The matching of regexps, on the other hand,
is a god-awful engineering nightmare.  Pick your favorite language and study
its regexp-matching source code!  Study its bug history.  Try to improve it.
Ponder why no two regexp implementations even support the same notation.
Besides being fun, it will give you a healthy skepticism about "better"
<wink>.

>> One extreme>Friedl calls "the NFA" way (although he's using
>> "NFA" in a sense that >differs from its technical meaning in the
>> literature).

> Yes, reading your post I understand that the reason is these are not NFA,
> "backreferences" are not part of regualar expressions and so we are not
> talking of regular languages.

I had more in mind that an NFA is formally defined in various ways by
various authors (like with or without epsilon transitions, with or without
multiple start states, with or without multiple same-symbol outgoing
transitions from a single state, with or without expressions (as opposed to
single symbols) on transitions, ...), but that in the implemented engines
Friedl calls "NFA" you'll rarely find any of that stuff.  At least not as
such.  The typical "NFA engine" in practice is really a general-purpose
graph-crawling backtracking engine, as powerful as the engine in SNOBOL4 --
and about as hard to predict in all cases.  While regexp notations rarely
let you get at all that power (not to mention debugging headaches), there's
been a steady trend in the Perl language to expose more of it over time
(e.g., to match nested bracket constructs, which is in the realm of
context-free languages, and so beyond what an NFA (in the non-Friedl sense)
can do).  Backreferences are just part of that; the point is that Regexp
Theory doesn't even *inform* most people who write a non-DFA regexp engine,
and that's why Regexp Theory has no useful point of contact with such
engines.

they're-only-called-regexps-they're-really-luxury-yachts-ly y'rs  - tim