re vs. sgmllib (was: Moving from Perl to Python)

Tim Peters tim_one at email.msn.com
Sun Sep 26 22:01:01 EDT 1999


[Jon Fernquest]
> Since regular expressions are just a short-hand way of specifying
> (practically only *some*) regular languages whereas finite state
> machines can specify any regular language the next logical step
> would be a set of finite state tools like those that Xerox sells
> (for several thousands of dollars I might add).

Then I hope we can skip the next logical step and leapfrog illogically to a
real parser <wink>.  Part of the problem here is that what people want to
parse these days-- from programming language fragments to SGML --isn't
regular.  That doesn't stop them from trying to do it with regexps, and
input-sensitive bug-ridden code is the result.  Heck, most people find it a
challenge to write a correct regexp to match a Python string -- or even a C
/**/ comment.  Not that regular languages aren't useful, but I expect their
appropriate non-trivial applications will always be a wizard art.

> Perl's adoption of regular expressions was sort of a revolution
> I guess,

Na, Perl grew up in the Unix zoo, where at least a dozen popular tools used
their own flavor of regexps before it.  Awk in particular pioneered tight
integration of regexps with a programming language, and Perl didn't add much
essential to what Awk did with them (indeed, Awk is still more convenient
for some kinds of text-crunching tasks!).  What Perl did do is combine the
best features of all the preceding regexp notations, toss the worst, add a
few nice twists of its own, and make it all dance.  Perl had some real
innovations, but wrt regexps it was mostly a nice synthesis of prior art.

> but there's another revolution looming on the horizon for the
> language that incorporates generalized finite state technology.

Python will be happy to accept a module <wink>.

> Finite state technology is really great for dealing with
> non-roman character sets,

Free Unicode regexp packages already exist, e.g.

    http://ourworld.compuserve.com/homepages/John_Maddock/regexpp.htm

is a nice one for C++; and more are on the way.  Since a million programmers
have already been deluded <0.6 wink> into thinking regexps are "the answer",
they're going to want more of the same.

> [more advocacy, and cool references, elided]
> ...
> The little language Gema also has language "acceptor" objects and
> also some recursive pattern matching capability which can be used
> to parse.
> http://www.telerama.com/~mundie/index.html

Here's example 1 from http://www.telerama.com/~mundie/Gema/GemaGems.html:

    Example 1
    Take a tab-delimited text file and make an HTML table out of it

    \n\n*\n\n=<table border>\n$1</table>
    \L<U>=\t<tr>\n at makerow{$0}\t</tr>
    makerow:<P>=\t\t<td>$1</td>\n;?=

It certainly appeals to the Perl eye <wink>.

Note that Perl is in the process of adding recursive "regexps".

the-mind-boggles-ly y'rs  - tim






More information about the Python-list mailing list