[XML-SIG] dom building, sax, and namespaces

Wed, 23 Jan 2002 11:09:54 +0100 (CET)

On Wed, 23 Jan 2002, Andrew Dalke wrote:

> Hey all,
> 
>   I'm trying to work with XML namespaces the backwards way
> around, and I'm getting lost.  I'm trying to figure out
> what I need to do to build a DOM via SAX events with
> proper namespace support so I can do XPath queries on
> the resulting document.
> 
>   I'm lost because I can't figure out what's SAX1 compared
> to SAX2, how namespaces changes things (eg, what's the
> difference between startElement and startElementNS?),
> and how to use XPath for namespace'ed queries.

first, startElement and startElementNS are both part of SAX2 (SAX1 is
deprecated). Which one is called by the parser depends on the
'feature_namespaces' feature. With the feature_namespaces on, the parser
call the *NS methods and does a part of the namespace  handling job.

>   I ordered the Jones&Drake "Python&XML" book but only found
> 1/2 page on namespaces.  I found documentation on the 4Suite
> site which gives me enough to do a namespace'd XPath if I
> read the XML from a file - I needed a Context with the
> propere namespace defined.  But I couldn't figure out how
> to make that DOM "by hand."
> 
>   Here's some background to know where I'm coming from.
> 
>   I've written a parser engine called Martel (some details at
> http://www.dalkescientific.com/  ).  This uses a regular
> expression grammar to generate a parser, which parses a
> file and produce SAX events.  I wrote it as a way to transition
> between existing flat-/semi-structured formats and XML.
> 
> Here's an example use.  Suppose you have a simple "key = value"
> file format where lines in the file look like
> 
> # This is a comment
> name = Andrew
> city = Santa Fe
> 
> With Martel this could be parsed with
> 
> import Martel
> skip_line = Martel.Re(r"#[^\n]*\n| *\n")
> kv = Martel.Re(r"(?P<key>\w+) *= *(?P<value>[^\n]*\n")
>   # Group is the same as r"(?P<entry>....)
> entry = Martel.Group("entry", kv)
> 
> # Can have 0 or more repeats of these two line definitions
> format = Martel.Rep(skip_line | entry)
> 
> # Make the parser
> parser = format.make_parser()
> 
> # Now show the XML output
> from xml.sax import saxutils
> parser.setContentHandler(saxutils.XMLGenerator())
> parser.parse(open("file.dat")
> 
> With the above example text, the output is
> 
> <?xml ....>
> # This is a comment
> <entry><key>name</key> = <value>Andrew</value>
> </entry><entry><key>city</key> = <value>Santa Fe</value>
> </entry>

herr, I guess this is only a part of your xml output, since this sample
isn't a well formed xml document (no root element)

> which makes it easy to pick out those key/value fields
> using standard XML techniques.
> 
> I want to come up with a set of tag names which can be
> shared across different format definitions.  I thought
> the best solution was to put them in their own namespace,
> as in:
> 
>   <bioformat:dataset>
>     <bioformat:record>
>   ID   <bioformat:dbid type="primary">100K_RAT</bioformat:dbid> ...
>   AC   <bioformat:dbid type="accession">P126943</bioformat:dbid> ...
> 
> (FYI, attrs are encoded in the regular expression as
>    (?P<bioformat:dbid?type=primary>expression)
> )
> 
> This is all done with my limited understanding of namespaces
> and SAX, so the ContentHandler gets the following events
> 
> startDocument()
> startElement("bioformat:dataset", {})
> startElement("bioformat:record", {})
> characters("ID   ")
> startElement("bioformat:dbid", {"type": "primary"})
> characters("100K_RAT")
> endElement("bioformat:dbid")
>  ..
> characters("AC   ")
>  ...
> 
> (The {}'s are proper Attribute objects, and not simple {}s)
> 
> I can stick a pulldom.SAX2DOM on the parser as the ContentHandler,
> and it produces a document.  However, it seems that I can't
> get access to the namespaced terms via XPath.  I know I'm
> not far wrong because I can use XPath to get the non-namespaced
> fields, and if I save the text to a file then read it in then
> I can also do namespace'ed XPath queries.  (Umm, though to do
> that I need to change the input text to include a namespace
>   <bioformat:dataset xmlns:bioformat="http://biopython.org/bioformat">
> )

the namespace declaration should be included in your generated events. I
guess that's why you can't access the namespaced elements via XPath (I
don't why pulldom.SAX2DOM doesn't complain, but it should) 

> So I suspect that my understanding of the SAX is incorrect. 
> Can someone here show me how to generate SAX2 events by-hand
> and put the results in a DOM, then show how to do an XPath
> query on that DOM?

The correct sax events for your example should be:

_ if feature_namespaces == 0

startDocument()
startElement('bioformat:dataset',
             {'xmlns:bioformat': 'http://biopython.org/bioformat'})
startElement("bioformat:record", {}) 
...
endElement('bioformat:dataset')
endDocument()

_ if feature_namespaces == 1

startDocument() 
startPrefixMapping('bioformat', 'http://biopython.org/bioformat')
startElementNS(('http://biopython.org/bioformat','dataset'),
               'bioformat:dataset', {}) 
startElementNS(('http://biopython.org/bioformat', 'record'),
               'bioformat:record', {}) 
characters("ID   ")
startElementNS(('http://biopython.org/bioformat', 'dbid'), 
               'bioformat:dbid', 
               {(EMPTY_NAMESPACE, 'type'): 'primary'})
...
endElementNS(('http://biopython.org/bioformat','dataset'), 
             'bioformat:dataset')
endPrefixMapping('bioformat')
endDocument()

namespaces declaration are reported in the attributes dictionnary or not
(BTW, it's a AttributeNSImpl in the feature_namespaces on
version) depending on the feature_namespace_prefixes feature.
Look at xml.sax.saxutils.XMLGenerator to see how to handle namespaces and
dom generation (the main difficulty is to handle a context stack which
map defined prefixes to namespaces)
To do an xpath query on the dom tree, you have to use a context (as you
seem to be aware). Here is an example (where dom_node is your dom
document):

from xml.xpath import Compile
from xml.xpath.Context import Context
path = Compile('bioformat:dbid[@type="primary"]')
context = Context(dom_node,
           processorNss={'bioformat' : 'http://biopython.org/bioformat'})
node_set = path.evaluate(context)

> I also don't know how to deal with parser features, but that's
> something that can wait.  I'll be at the conference, with code,
> and wanting to bug people in person.  :)

see  http://www.saxproject.org/ for more information

hope that helps !

regards

-- 
Sylvain Thenault

  LOGILAB           http://www.logilab.org