[XML-SIG] Python 1.6 XML APIs

Tue, 20 Jun 2000 15:20:06 +0200

We need a little more concentrated coordination on XML in Python 1.6.
I'll do what I can over the next two weeks. I did some thinking about
what I consider a coherent strategy while I was on a plane recently.
Here is what I'm thinking:

====

We know that no single Python processing toolkit can be everything to
everyone. Each must make performance/ease of use trade-offs that will
not always be applicable. Therefore we need more than one XML parsing
API.

I think that there are two main axes where performance and ease of use
are traded off. The one axis is labelled "full tree versus streaming."
DOM and qp_xml are "full tree". SAX is streaming. There is a strong
concensus that we need both full tree and streaming APIs in Python 1.6.
For various quasi-technical reasons, I think that must people expect
those APIs to be SAX 2 (or some subset) and some sort of miniature DOM.

Therefore I am in the process of cleaning up minidom. I am almost done
-- the last step is SAX 2 integration. I think Lars is almost done with
SAX 2 so we are doing pretty well.

The other axis is labelled "friendly XML-specific objects" versus
"primitive Python objects". The DOM uses "friendly XML-specific objects"
whereas qp uses primitive objects. I think that both options are
important so I favor putting qp into Python 1.6 if it can be made SAX 2
compatible and properly documented in time. (I am willing to work on
this)

tree/primitive objs = qp
tree/XML objs = minidom
streaming/primitive objs = SAX
streaming/XML objs = ???

In the fourth quandrant are libraries like my EventDOM which are
streaming but use friendly objects. Right now, EventDOM is way too
heavyweight for the standard distribution because it is dispatcher is so
sophisticated (and slow!!)

Nevertheless, in only 150 lines I have implemented a streaming API that
uses friendly DOM objects. I call it PullDOM. It has the following
characteristics:

 * it builds heavily on minidom, which is why it is so small

 * minidom itself is only 600 lines (it might grow by a third once we
add convenience functions and other such junk)

 * it uses a "pull" methodology which is a little more flexible and easy
to learn than the traditional "push". In the documentation we can
describe how to build a ten-line dispatch engine. (see below)

 * the API is brain-dead simple (see below)

 * every node knows its parent nodes so context-based checking is easy

 * any node can easily be expanded into a "subtree" easily -- you get
some of the benefits of a tree-API with much less overhead

 * processes Hamlet in 2 seconds on P3/450

 * simple! simple! simple! convenient! convenient! convenient!

In general, I think that it is a really nice simplicity/performance
middle ground. Much, much, much easier to use then straight SAX and
much, much, much more performant (esp. for large documents) than DOM.

Right now the API consists of basically two functions and one class with
one method and one protocol.

Functions:

parse( stream_or_filename_or_url)
parseXML( string )

Each of these returns a DOMEventStream object. It can be used in one of
two ways:

1.

for (token_type, node) in pulldom.parse( "hamlet.xml" ):
    print token_type, node

2.

events=pulldom.parse( "hamlet.xml" )

while token:
    token=events.getEvent( )
    if token:
        (token_type, node)=token
        print token_type, node

token_types are: ("START_ELEMENT", "END_ELEMENT", 
                        "COMMENT", "START_DOCUMENT", "END_DOCUMENT",
                        "PROCESSING_INSTRUCTION", 
                        "IGNORABLE_WHITESPACE", "CHARACTERS")

At any point you can build a subtree:

if token_type=="START_ELEMENT" and node.tagName=="TABLE" \
	and node.namespaceURI="http://www.w3.org/...":
    events.expandNode( node )
    print node.child_nodes

Now the node has children. The next call to getEvent (or __getitem__)
returns the node that follows this one, not a child node.

====

Why didn't I put in a dispatcher? I've been down this path many times
before. First you want to dispatch on node types. Then element types.
Then namespace-qualified element types. Then namespaces with no element
type and element types with no namespaces. Then context. Then attribute
values. Then context AND attribute values. Eventually you end up
reinventing XSLT in Python syntax. I am totally in favor of reinventing
XSLT in Python but *not* as part of the standard distribution (at least
not yet).

Therefore, I will write documentation that *demonstrates* a few of these
dispatching strategies and let the user use their imagination. Using
simple DOM commands you can get from children to parents, check
attributes, check namespaces, etc. You don't need to learn some kind of
addressing "sublanguage" -- just use the same old DOM properties.

===

I would like the same two parse methods to be available in minidom, qp,
and pulldom. So 

minidom.parse("hamlet.xml") gives you a DOM
qp.parse( "hamlet.xml" ) gives you a qp data structure
pulldom.parse( "hamlet.xml"  ) gives you a DOM event stream
sax.parse( "hamlet.xml" ) doesn't really return anything, but it
processes your document with your document handler.

Under the covers, all APIs use SAX, so behavior should be extremely
consistent between all modules.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
"Music is the stuff between the notes." - Claude Debussy