[XML-SIG] dom building, sax, and namespaces

Wed, 23 Jan 2002 01:40:22 -0700

Hey all,

  I'm trying to work with XML namespaces the backwards way
around, and I'm getting lost.  I'm trying to figure out
what I need to do to build a DOM via SAX events with
proper namespace support so I can do XPath queries on
the resulting document.

  I'm lost because I can't figure out what's SAX1 compared
to SAX2, how namespaces changes things (eg, what's the
difference between startElement and startElementNS?),
and how to use XPath for namespace'ed queries.

  I ordered the Jones&Drake "Python&XML" book but only found
1/2 page on namespaces.  I found documentation on the 4Suite
site which gives me enough to do a namespace'd XPath if I
read the XML from a file - I needed a Context with the
propere namespace defined.  But I couldn't figure out how
to make that DOM "by hand."

  Here's some background to know where I'm coming from.

  I've written a parser engine called Martel (some details at
http://www.dalkescientific.com/  ).  This uses a regular
expression grammar to generate a parser, which parses a
file and produce SAX events.  I wrote it as a way to transition
between existing flat-/semi-structured formats and XML.

Here's an example use.  Suppose you have a simple "key = value"
file format where lines in the file look like

# This is a comment
name = Andrew
city = Santa Fe

With Martel this could be parsed with

import Martel
skip_line = Martel.Re(r"#[^\n]*\n| *\n")
kv = Martel.Re(r"(?P<key>\w+) *= *(?P<value>[^\n]*\n")
  # Group is the same as r"(?P<entry>....)
entry = Martel.Group("entry", kv)

# Can have 0 or more repeats of these two line definitions
format = Martel.Rep(skip_line | entry)

# Make the parser
parser = format.make_parser()

# Now show the XML output
from xml.sax import saxutils
parser.setContentHandler(saxutils.XMLGenerator())
parser.parse(open("file.dat")

With the above example text, the output is

<?xml ....>
# This is a comment
<entry><key>name</key> = <value>Andrew</value>
</entry><entry><key>city</key> = <value>Santa Fe</value>
</entry>

which makes it easy to pick out those key/value fields
using standard XML techniques.

I want to come up with a set of tag names which can be
shared across different format definitions.  I thought
the best solution was to put them in their own namespace,
as in:

  <bioformat:dataset>
    <bioformat:record>
  ID   <bioformat:dbid type="primary">100K_RAT</bioformat:dbid> ...
  AC   <bioformat:dbid type="accession">P126943</bioformat:dbid> ...

(FYI, attrs are encoded in the regular expression as
   (?P<bioformat:dbid?type=primary>expression)
)

This is all done with my limited understanding of namespaces
and SAX, so the ContentHandler gets the following events

startDocument()
startElement("bioformat:dataset", {})
startElement("bioformat:record", {})
characters("ID   ")
startElement("bioformat:dbid", {"type": "primary"})
characters("100K_RAT")
endElement("bioformat:dbid")
 ..
characters("AC   ")
 ...

(The {}'s are proper Attribute objects, and not simple {}s)

I can stick a pulldom.SAX2DOM on the parser as the ContentHandler,
and it produces a document.  However, it seems that I can't
get access to the namespaced terms via XPath.  I know I'm
not far wrong because I can use XPath to get the non-namespaced
fields, and if I save the text to a file then read it in then
I can also do namespace'ed XPath queries.  (Umm, though to do
that I need to change the input text to include a namespace
  <bioformat:dataset xmlns:bioformat="http://biopython.org/bioformat">
)

So I suspect that my understanding of the SAX is incorrect.
Can someone here show me how to generate SAX2 events by-hand
and put the results in a DOM, then show how to do an XPath
query on that DOM?

I also don't know how to deal with parser features, but that's
something that can wait.  I'll be at the conference, with code,
and wanting to bug people in person.  :)

Thanks!

                    Andrew
                    dalke@dalkescientific.com