partial parsing?

Mon Apr 24 04:33:38 EDT 2000

It looks like I wasn't clear in my post, so let me start over.
This time I'll explain the problem domain first, and end up with
the question part.

I need to parse data files of two broad clases, data files and
output files.  The first contains many sets of data records, which
were designed to be parsed by hand-coded parsers written by non-CS
people (no CF grammers here).  These files range from 10K up to
several hundred MB.  The second class, output from executables,
weren't really designed to be machine parsable, but since they are
machine generated, they can be parsed.

In both cases, the formats contain more information than is usually
needed.  I often just want the record identifier and one or two other
fields, and not the author's name, the publication data, the date
of submission, the ... I think you get it.  And in some cases, I just
want the start/end position of the record to make an index.

Additionally, all of the formats are line oriented.  Nearly all are
machine generated, so there's a good likelyhood that the data is in
the right format.  The lexical elements are quite stateful; the word
'sequence' may mean three or more things in the same record, depending
on the line and position.

On the simplification side, I believe I can get away with parsing
regular grammers only.  Some of the formats, about 25%, are non-regular.
Indded, expect three lines in a row to have the same length, which is
a context sensitive grammer, but the assertion that they are in the
right format means a regular grammer will work.  (Eg, a[n]b[n]c[n]
can be parsed with a*b*c* if it is guaranteed that the input grammer
is correct.)

Examples of the database files I need to parse are:
 220MB at ftp://ftp.expasy.ch/databases/swiss-prot/release/sprot38.dat
 21MB (compressed) at ftp://ncbi.nlm.nih.gov/genbank/gbmam.seq.Z
 55K at http://www.expasy.ch/cgi-bin/get-pdb-entry.pl?9pti
HTML marked-up records from the first two formats are at:
 http://www.expasy.ch/cgi-bin/get-sprot-entry?P80222

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&lis
t_uids=3822546&dopt=GenBank
Links from the "DR" lines of the SwissProt record (the first one) gives
pointers to another dozen database formats.

Examples of executable output I need to parse are at:
 490K at ftp://bio.perl.org/pub/katel/Blast/Nucleic/wisteria.txt
 3K (a small one!) at ftp://bio.perl.org/pub/katel/Blast/Nucleic/lupine.txt

I usually write event driven parsers, in the style of SAX for XML
parsing.  This works well if I want to parse everything, but there's
a lot of overhead if I only need the couple of fields - I end up
ignoring most of them.  Indeed, most people usually write specialized
parsers to extract just the needed data.  Since the formats are line
oriented, that means something like:

  find the line starting with 'ID'; read the first word,
  find the lines starting with 'SQ'; get rest of the data in the line,
  record ends at the line equal to '//'.

In perl, this is ridiculously easy;
  $id = ''; $sq = '';
  while (<>) {
    if (/^ID (\w+)/) { $id = $1; next; }
    if (/^SQ (.*)/) { $sq .= $1; next; }
    if ($_ eq '//\n') {
       #&do_something($id, $sq)
       $id = ''; $sq = '';
    }
  }
That's one reason why perl is popular in this field.

So I want some way to start with the full grammer for the format,
then generate a parser which sends events for only those terms
I'm interested in, and allows optimizations which assume the format
is correct, so it can do thing like skip some bytes, or read a few
lines without checking what's on them.

Say, perhaps, like:

  import sys, ParserGen
  from Formats import GivenFormat
  parser = ParserGen.LenientParser(GivenFormat,
                           tags = ("identifier", "sq_data", "end") )
  class SQHandler:
    def __init__(self, func):
      self.reset()
      self.func = func
    def reset(self):
      self.id = self.sq = ''
    def parse(self, tag, text):
      if tag == "identifier":
        self.id = text
      elif tag == "sq_data":
        self.sq = self.sq + text
      elif tag == "end":
        self.func(self.id, self.sq)
        self.reset()
  def do_something(id, seq):
    pass
  parser.set_handler(SQHandler(do_something))
  parser.parse(sys.stdin)

I know, it looks more complicated than the perl version, but it
allows reads from any file-like object, and once created, the
SQHandler object can be reused, so the amortized cost is quite
small.  Plus, the same handler can work with related formats, perhaps
with the right adapter.

I've been looking at existing parser generation systems, such as
lex/yacc, SPARK and Plex.  They all seem to be designed to verify
every character in the file, so I lose some of the performance
given the guarantee of having a valid format.

They also prefer having a good context free grammer, as compared
to the stateful one I have.  While they do allow the lexer to have
different states, the specification for doing so isn't very natural
(IMHO).  Or to put it another way, the author of the lex/yacc
O'Rielly book says he did a FORTRAN parser purely in lex because
yacc didn't work very well.  I think that's relevant, since the
formats I deal with have the same sort of feel as I expect a
FORTRAN parser to work.

Indeed, what I'm envisioning is more like a super-scanner than a
parser, since the formats I need this system for are almost all
regular in nature, rather than context free.

So!  Since what I think I'm looking for doesn't seems to exist in
the toolsets I'm used to using, I'm going to have to write something
(most likely writing to mxTextTools or Plex for the back end).
Since I doubt I'm working on a unique situation, I figure there
must be existing code that does this which I haven't heard about,
or at least papers describing it.  I don't know exactly what I'm
looking for, so any pointer would be nice.  BTW, the implementation
for what I'm thinking about feels pretty straight-forward, but I
think it's best to compare what others have done.

"It", to reiterate, is a generator for event-based parsers which
are optimized to create events only for selected subsets of a given
(regular?) grammer and which can make the assumption that the input
file is in the correct format.

                    Andrew Dalke
                    dalke at acm.org