XML parsing besides SAX and DOM

Fri Dec 14 16:13:06 EST 2001

Me:
>  I've been doing XML processing of a record-oriented
>data set where in a record:
>
>  - the data isn't deep (pretty much flat)
>  - all the text is used from between the start and end tags
>  - the text is short (easily fits into memory)
>
>I developed another XML API to simplify this case, and I'm
>curious about other similar systems (eg, I may want to
>scrap my work).  I know about SAX and DOM and about RAX.
>What others are there?

Someone asked me in personal email why I thought the existing
systems were too complicated.  Since I think my response
describes one of the things I feel is clunky about working
with XML, I'm posting it here.

   ===
Suppose you have a file containing lines formatted with one
of three forms:
1) name = value
  example: color = light blue
2) blank (only spaces)
3) comments indicated by a '#' in the first column

The standard way to parse that (using Python 2.2 iterators) is
something like this:

for line in file:
  if line[:1] == "#":
    continue
  if line.strip() == "":
    continue
  pos = line.find("=")
  if pos == -1:
    raise TypeError("Cannot find the '=' in %s" % repr(line))
  left = line[:pos].strip()
  right = line[pos+1:].strip()
  print repr(left), repr(right)

Consider the equivalent XML file with entries of the form

<entry><left>color</left><right>light blue</right></entry>

Parsing this with SAX requires something akin to

import xml.sax.handler

class ReadEntries(handler.ContentHandler):
  def startDocument(self):
    self.capture = 0
    self.text = None
    self.data = None
  def startElement(self, tag, attrs):
    if tag in ["left", "right"]:
      self.capture = 1
      self.data = {}
      self.text = ""
  def characters(self, s):
    if self.capture:
      self.text += s
  def endElement(self, tag):
    if tag in ["left", "right"]:
      self.data[tag] = self.text
      self.text = None
      self.capture = 0
    elif tag == "entry":
      print self.data["left"], self.data["right"]
      self.data = None

import xml.sax
parser = xml.sax.make_parser()
parser.setContentHandler(ReadEntries())
parser.parseFile(...)

Doing the same with DOM is less work, but I don't recall
the way to do it -- plus, the DOM implementations for Python want
to store the whole document in memory, and some of my documents
are much bigger than physical RAM.

The project I'm working on is a way to tokenize flat files
as if they are in XML, so I can make

color = light blue

generate SAX2 events to make it look like

<entry><left>color </left>=<right>light blue</right>

This can be done with

import Martel, SimpleFields

comment = Martel.Group("comment",
                       Martel.Str("#") + Martel.ToEol())
entry = Martel.Group("entry",
         Martel.Re(r"(?P<left>[^=]+)=(?P<right>[^\R]+)\R"))
blank = Martel.Re(r"(?P<blank>[ \t]*\R)")
format = Martel.Rep(comment | entry | blank)

iterator = format.make_iterator("entry")
for record in iterator.iterateString("color = light blue\n",
                             SimpleFields.SimpleFields()):
  print record.groups["left"][0].strip(),
  print record.groups["right"][0].strip()

which is only a couple lines more than the easiest flat-file
solution.  Yet it's all based on XML parsing techniques.
(It also requires an iterator object that can turn the SAX
callback into iterable chunks, which I've done.)

Since I can't believe I'm the first to develop a technique
like this (indeed, I got some of the ideas for it based on
RAX), I'm looking to see what I can learn from related
approaches.

                    Andrew
                    dalke at dalkescientific.com