Martel-0.3 is now available from http://www.biopython.org/%7Edalke/Martel/
Martel is a scanner generator for regular languages which uses SAX events to send the parse tree information back to the caller. It's goal is to provide a transition from many existing flat and semi-structured file formats (especially those needed for biopython :) and XML. More details can be found from a recent conference poster available at http://www.biopython.org/%7Edalke/Martel/BOSC2000.poster/ .
Implementation details: o the format description can be a combination of a regular expression strings and Python functions/object: - strings are parsed with a modified version of /F's sre_parse - Python functions are based on Greg Ewing's Plex
o the regular expression is turned into an tag table for Marc Andre Lemburg's mxTextTools, which does the actual parsing.
o the parser acts like a SAX parser and the resultant tag list is turned into SAX events for the registered handlers.
This is the last version of Martel for Python 1.5.2.
Changes from 0.2 to 0.3: - Added documentation on the internals and on how to write a parser. - Renamed and moved Generate.StateTable to Parser.Parser - Renamed the various "ContentHandler" to "DocumentHandler." ContentHandler was flat out the wrong method name for SAX. - The parser and exceptions now inherit from the xml.sax.saxlib classes. - To parse a string, use the "parseString" method. The old "parse" method now takes a system identifier string. A system identifier is the SAX way of saying URL. (Note: this will change again with Python 2.0 and the InputSource class.) - The "generate_*" commands now manipulate lists directly instead of passing around 'Parser' objects. - Added parsers which can read a record at a time (ParseRecord and the RecordReader classes.) - Added the optimize module, which does some limited regexp expression cleanups and optimizations. Haven't tested the performance differences yet. - Consistent naming schemes to distinguish between a regexp written as a string (a "pattern"), a parse tree (an "expression") or an mxTextTools table (a "tagtable"). - The "Subpattern" Node was renamed to "Group" for naming consistency with Plex. "Any" was renamed "Dot". "In" was renamed "Any. - Fixed several bugs when translating from an expression tree back to a pattern string. - Added docstrings and comments. - Added type check for the external Plex-like functions, since I was getting annoyed that the error for doing 'Opt("text")' instead of 'Opt(Str("text"))' occured during tagtable generation and was hard to track down. - Moved self test code from the modules into the test/ directory. - Changed the regression code to raise an Assertion error when there was a problem rather than just printing the error and continueing.
Andrew Dalke email@example.com