XML parsing besides SAX and DOM
Andrew Dalke
dalke at dalkescientific.com
Fri Dec 14 16:13:06 EST 2001
Me:
> I've been doing XML processing of a record-oriented
>data set where in a record:
>
> - the data isn't deep (pretty much flat)
> - all the text is used from between the start and end tags
> - the text is short (easily fits into memory)
>
>I developed another XML API to simplify this case, and I'm
>curious about other similar systems (eg, I may want to
>scrap my work). I know about SAX and DOM and about RAX.
>What others are there?
Someone asked me in personal email why I thought the existing
systems were too complicated. Since I think my response
describes one of the things I feel is clunky about working
with XML, I'm posting it here.
===
Suppose you have a file containing lines formatted with one
of three forms:
1) name = value
example: color = light blue
2) blank (only spaces)
3) comments indicated by a '#' in the first column
The standard way to parse that (using Python 2.2 iterators) is
something like this:
for line in file:
if line[:1] == "#":
continue
if line.strip() == "":
continue
pos = line.find("=")
if pos == -1:
raise TypeError("Cannot find the '=' in %s" % repr(line))
left = line[:pos].strip()
right = line[pos+1:].strip()
print repr(left), repr(right)
Consider the equivalent XML file with entries of the form
<entry><left>color</left><right>light blue</right></entry>
Parsing this with SAX requires something akin to
import xml.sax.handler
class ReadEntries(handler.ContentHandler):
def startDocument(self):
self.capture = 0
self.text = None
self.data = None
def startElement(self, tag, attrs):
if tag in ["left", "right"]:
self.capture = 1
self.data = {}
self.text = ""
def characters(self, s):
if self.capture:
self.text += s
def endElement(self, tag):
if tag in ["left", "right"]:
self.data[tag] = self.text
self.text = None
self.capture = 0
elif tag == "entry":
print self.data["left"], self.data["right"]
self.data = None
import xml.sax
parser = xml.sax.make_parser()
parser.setContentHandler(ReadEntries())
parser.parseFile(...)
Doing the same with DOM is less work, but I don't recall
the way to do it -- plus, the DOM implementations for Python want
to store the whole document in memory, and some of my documents
are much bigger than physical RAM.
The project I'm working on is a way to tokenize flat files
as if they are in XML, so I can make
color = light blue
generate SAX2 events to make it look like
<entry><left>color </left>=<right>light blue</right>
This can be done with
import Martel, SimpleFields
comment = Martel.Group("comment",
Martel.Str("#") + Martel.ToEol())
entry = Martel.Group("entry",
Martel.Re(r"(?P<left>[^=]+)=(?P<right>[^\R]+)\R"))
blank = Martel.Re(r"(?P<blank>[ \t]*\R)")
format = Martel.Rep(comment | entry | blank)
iterator = format.make_iterator("entry")
for record in iterator.iterateString("color = light blue\n",
SimpleFields.SimpleFields()):
print record.groups["left"][0].strip(),
print record.groups["right"][0].strip()
which is only a couple lines more than the easiest flat-file
solution. Yet it's all based on XML parsing techniques.
(It also requires an iterator object that can turn the SAX
callback into iterable chunks, which I've done.)
Since I can't believe I'm the first to develop a technique
like this (indeed, I got some of the ideas for it based on
RAX), I'm looking to see what I can learn from related
approaches.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list