[XML-SIG] dom building, sax, and namespaces
Andrew Dalke
Andrew Dalke" <dalke@dalkescientific.com
Wed, 23 Jan 2002 01:40:22 -0700
Hey all,
I'm trying to work with XML namespaces the backwards way
around, and I'm getting lost. I'm trying to figure out
what I need to do to build a DOM via SAX events with
proper namespace support so I can do XPath queries on
the resulting document.
I'm lost because I can't figure out what's SAX1 compared
to SAX2, how namespaces changes things (eg, what's the
difference between startElement and startElementNS?),
and how to use XPath for namespace'ed queries.
I ordered the Jones&Drake "Python&XML" book but only found
1/2 page on namespaces. I found documentation on the 4Suite
site which gives me enough to do a namespace'd XPath if I
read the XML from a file - I needed a Context with the
propere namespace defined. But I couldn't figure out how
to make that DOM "by hand."
Here's some background to know where I'm coming from.
I've written a parser engine called Martel (some details at
http://www.dalkescientific.com/ ). This uses a regular
expression grammar to generate a parser, which parses a
file and produce SAX events. I wrote it as a way to transition
between existing flat-/semi-structured formats and XML.
Here's an example use. Suppose you have a simple "key = value"
file format where lines in the file look like
# This is a comment
name = Andrew
city = Santa Fe
With Martel this could be parsed with
import Martel
skip_line = Martel.Re(r"#[^\n]*\n| *\n")
kv = Martel.Re(r"(?P<key>\w+) *= *(?P<value>[^\n]*\n")
# Group is the same as r"(?P<entry>....)
entry = Martel.Group("entry", kv)
# Can have 0 or more repeats of these two line definitions
format = Martel.Rep(skip_line | entry)
# Make the parser
parser = format.make_parser()
# Now show the XML output
from xml.sax import saxutils
parser.setContentHandler(saxutils.XMLGenerator())
parser.parse(open("file.dat")
With the above example text, the output is
<?xml ....>
# This is a comment
<entry><key>name</key> = <value>Andrew</value>
</entry><entry><key>city</key> = <value>Santa Fe</value>
</entry>
which makes it easy to pick out those key/value fields
using standard XML techniques.
I want to come up with a set of tag names which can be
shared across different format definitions. I thought
the best solution was to put them in their own namespace,
as in:
<bioformat:dataset>
<bioformat:record>
ID <bioformat:dbid type="primary">100K_RAT</bioformat:dbid> ...
AC <bioformat:dbid type="accession">P126943</bioformat:dbid> ...
(FYI, attrs are encoded in the regular expression as
(?P<bioformat:dbid?type=primary>expression)
)
This is all done with my limited understanding of namespaces
and SAX, so the ContentHandler gets the following events
startDocument()
startElement("bioformat:dataset", {})
startElement("bioformat:record", {})
characters("ID ")
startElement("bioformat:dbid", {"type": "primary"})
characters("100K_RAT")
endElement("bioformat:dbid")
..
characters("AC ")
...
(The {}'s are proper Attribute objects, and not simple {}s)
I can stick a pulldom.SAX2DOM on the parser as the ContentHandler,
and it produces a document. However, it seems that I can't
get access to the namespaced terms via XPath. I know I'm
not far wrong because I can use XPath to get the non-namespaced
fields, and if I save the text to a file then read it in then
I can also do namespace'ed XPath queries. (Umm, though to do
that I need to change the input text to include a namespace
<bioformat:dataset xmlns:bioformat="http://biopython.org/bioformat">
)
So I suspect that my understanding of the SAX is incorrect.
Can someone here show me how to generate SAX2 events by-hand
and put the results in a DOM, then show how to do an XPath
query on that DOM?
I also don't know how to deal with parser features, but that's
something that can wait. I'll be at the conference, with code,
and wanting to bug people in person. :)
Thanks!
Andrew
dalke@dalkescientific.com