re vs. sgmllib (was: Moving from Perl to Python)

Sun Oct 3 20:04:59 EDT 1999

Pipe:

http://starship.skyport.net/~lemburg/mxTextTools.html
http://members.home.com/mcfletch/programming/simpleparse/simpleparse.html

Simpleparse generates fairly naive state machines for mxTextTools (i.e. I
don't do any gross optimisation, not even fast-map-type optimisations of
leftmost productions) which only handles a very limited grammar set (most
notable limitation is that it will not handle the following correctly if you
are expecting regex-style parsing:

v := t*, s

In essence, once a production (t*) has matched, backtracking does not cause
it to be re-tried with a smaller match.  This could be "fixed", but I
haven't had any need to do so myself.  Having the generator build the state
machine so that this will work even with complex multi-production
backtracking is a little beyond anything in which I'm interested.

As a minor note, using simpleparse and/or mxTextTools to define a tokeniser
might give a decently performing tokeniser engine on which the Earley system
could run (though there'd still be the basic overhead of the resolution
phase).  Since tokenisation is normally LL(1), the limitations of
simpleparse shouldn't stand in the way.  No need for it myself, but it might
interest someone somewhere.

Cheers,
Mike

-----Original Message-----
Tim:
...
I rarely save URLs, so you'll have to trudge thru DejaNews to find these (or
maybe their authors will pipe up):
...
Least conventional, & fastest:  Marc-Andre Lemburg's mxTextTools.  You
basically build your own state machine out of Python tuples, which are
executed by a C extension module; very fast, very delicate.  MikeF put a
more conventional layer on top of it, in his own unconventional <wink> way.
...