Martel-0.3 - a regexp engine generating SAX events
Mon, 9 Oct 2000 06:19:17 -0600
Martel-0.3 is now available from http://www.biopython.org/~dalke/Martel/
Martel is a scanner generator for regular languages which uses SAX events
to send the parse tree information back to the caller. It's goal is to
provide a transition from many existing flat and semi-structured file
formats (especially those needed for biopython :) and XML. More details
can be found from a recent conference poster available at
o the format description can be a combination of a regular expression
strings and Python functions/object:
- strings are parsed with a modified version of /F's sre_parse
- Python functions are based on Greg Ewing's Plex
o the regular expression is turned into an tag table for Marc Andre
Lemburg's mxTextTools, which does the actual parsing.
o the parser acts like a SAX parser and the resultant tag list is
turned into SAX events for the registered handlers.
This is the last version of Martel for Python 1.5.2.
Changes from 0.2 to 0.3:
- Added documentation on the internals and on how to write a parser.
- Renamed and moved Generate.StateTable to Parser.Parser
- Renamed the various "ContentHandler" to "DocumentHandler."
ContentHandler was flat out the wrong method name for SAX.
- The parser and exceptions now inherit from the xml.sax.saxlib classes.
- To parse a string, use the "parseString" method. The old "parse"
method now takes a system identifier string. A system identifier is
the SAX way of saying URL. (Note: this will change again with Python
2.0 and the InputSource class.)
- The "generate_*" commands now manipulate lists directly instead of
passing around 'Parser' objects.
- Added parsers which can read a record at a time (ParseRecord and the
- Added the optimize module, which does some limited regexp expression
cleanups and optimizations. Haven't tested the performance
- Consistent naming schemes to distinguish between a regexp written as a
string (a "pattern"), a parse tree (an "expression") or an mxTextTools
table (a "tagtable").
- The "Subpattern" Node was renamed to "Group" for naming consistency
with Plex. "Any" was renamed "Dot". "In" was renamed "Any.
- Fixed several bugs when translating from an expression tree back to a
- Added docstrings and comments.
- Added type check for the external Plex-like functions, since I was
getting annoyed that the error for doing 'Opt("text")' instead of
'Opt(Str("text"))' occured during tagtable generation and was hard to
- Moved self test code from the modules into the test/ directory.
- Changed the regression code to raise an Assertion error when there was
a problem rather than just printing the error and continueing.