Martel-0.3 - a regexp engine generating SAX events

Andrew Dalke dalke@acm.org
Mon, 9 Oct 2000 06:19:17 -0600


Martel-0.3 is now available from http://www.biopython.org/~dalke/Martel/

Martel is a scanner generator for regular languages which uses SAX events
to send the parse tree information back to the caller.  It's goal is to
provide a transition from many existing flat and semi-structured file
formats (especially those needed for biopython :) and XML.  More details
can be found from a recent conference poster available at
http://www.biopython.org/~dalke/Martel/BOSC2000.poster/ .

Implementation details:
  o the format description can be a combination of a regular expression
    strings and Python functions/object:
      - strings are parsed with a modified version of /F's sre_parse
      - Python functions are based on Greg Ewing's Plex

  o the regular expression is turned into an tag table for Marc Andre
    Lemburg's mxTextTools, which does the actual parsing.

  o the parser acts like a SAX parser and the resultant tag list is
    turned into SAX events for the registered handlers.

This is the last version of Martel for Python 1.5.2.

Changes from 0.2 to 0.3:
 - Added documentation on the internals and on how to write a parser.
 - Renamed and moved Generate.StateTable to Parser.Parser
 - Renamed the various "ContentHandler" to "DocumentHandler."
   ContentHandler was flat out the wrong method name for SAX.
 - The parser and exceptions now inherit from the xml.sax.saxlib classes.
 - To parse a string, use the "parseString" method.  The old "parse"
   method now takes a system identifier string.  A system identifier is
   the SAX way of saying URL.  (Note: this will change again with Python
   2.0 and the InputSource class.)
 - The "generate_*" commands now manipulate lists directly instead of
   passing around 'Parser' objects.
 - Added parsers which can read a record at a time (ParseRecord and the
   RecordReader classes.)
 - Added the optimize module, which does some limited regexp expression
   cleanups and optimizations.  Haven't tested the performance
   differences yet.
 - Consistent naming schemes to distinguish between a regexp written as a
   string (a "pattern"), a parse tree (an "expression") or an mxTextTools
   table (a "tagtable").
 - The "Subpattern" Node was renamed to "Group" for naming consistency
   with Plex.  "Any" was renamed "Dot".  "In" was renamed "Any.
 - Fixed several bugs when translating from an expression tree back to a
   pattern string.
 - Added docstrings and comments.
 - Added type check for the external Plex-like functions, since I was
   getting annoyed that the error for doing 'Opt("text")' instead of
   'Opt(Str("text"))' occured during tagtable generation and was hard to
   track down.
 - Moved self test code from the modules into the test/ directory.
 - Changed the regression code to raise an Assertion error when there was
   a problem rather than just printing the error and continueing.


                    Andrew Dalke
                    dalke@acm.org