[XML-SIG] SAX Namespaces

Paul Prescod paul@prescod.net
Mon, 03 Jul 2000 15:33:24 -0500


[still catching up]

> ...
> I think this is very wrong, and that we shouldn't do it. I've looked
> through the SAX code in the Python CVS tree and this is the only thing
> I'm really unhappy about.
> 
> Let me enumerate the reasons:
> 
>  - the rawname is not really part of the element name, so it does not
>    feel right to have it in the tuple like that

The rawname *is* the element type name. 

>  - it makes name comparison (the most common operation on names!)
>    really ugly:
>
>      if name[0] == xslt_ns and \
>         name[1] == 'template':
>         # do something useful

#1. I don't fundamentally believe that people will be doing this at the
SAX level. SAX is pretty painful for this sort of thing and I would like
to see people move to something with a stack and tree mode (whether
pulldom, Pyxie, whatever) soon. 

#2. The clean way to do what you describe is:

def startElement( self, (URI, name, rawname), attrs ):
    if (URI, name)==xslt_ns
        # do something useful

or:

def startElement( self, name, attrs ):
    (URI, localname, rawname)=name
    if (URI, localname)==xslt_ns
        # do something useful

>  - dispatching also becomes ugly:
> 
>      self._element_handlers[(name[0], name[1])] ()

Same deal.

>  - it makes it harder to make people understand that the prefix used
>    in the XML document is not part of the element name

Namespaces are complicated and nasty. The old SAX API did not change
that.

>  - I don't see anything that is made easier or faster by this
>    representation

The DOM, XPath and XSLT-based APIs will need the URI, localname, rawname
triple (or at least URI/rawname). It would be nice to pass the same
tuple from Pyexpat->SAX->... with no rebundling. In fact, I hope to
optimize the SAX layer away altogether. (by making PyExpat a SAX parser
and minidom et. al. SAX consumers).

>  - we already discussed this and decided for something else; I think
>    we should not change our minds without good reason

Agreed. We can go back to the way things were easily at this point. My
reason for changing is in getting a clearer picture of the needs of
higher layers. Requiring all three is not the exception: it is the rule. 

> I don't fully understand this argument. What benefits do you see over
> the (uri, localname), qname representation? Also, don't you think any
> speed gains will be lost here in the cost from all the method calls
> and object instantiations?

It is precisely method calls and objects instantiations that I'm trying
to avoid.

> However, after benchmarking the speed of a real application
> (Shakespeare to HTML converter) which created a new instance for each
> call against one that recycled the old instance (but updated the
> internal attributes) I gave this up. I think the speed increase was
> from 99 to 96 seconds for converting all of Shakespeare's plays.

First, would it have complicated your life so much to say:

	attrs=AttributesImpl( attrs )

Second, it isn't fair to compare building new attribute lists versus
using existing ones when I am talking about not building them at all.
Consider applications that do not deal with domain-specific attribute
names at all: canonalizing, normalizing, pyx-generation, DOM generation
and so forth. For these apps, the AttributeList API is expensive both
because of the object creations and because it isn't designed for
sequential access (versus name-based access).

Third, the API has no facilities for looking things up based on (URI,*)
or (*,URI) (needed for XPath) That means that XPath-type applications
may need to copy the data out into another data structure. Notice also
that .items() doesn't return the rawnames needed for xpath and so forth
so this copying is actually pretty expensive.

Overall, I think it tries to be "friendly" at too low a level in the
application stack. It can't know what the app programmer wants "down
there". Of course I think this way because I don't think that
"application" programmers should really work at the SAX layer because it
is inconvenient on many counts. It should be the most efficient possible
API for parsers and that's it. I am also trying to make SAX as efficient
as Expat's native API so that we do not have to support two APIs
forever.

I don't know if you read my description of the different sorts of APIs
but I categorized them into four quadrants along two axes: tree building
versus non and object-building versus primitive-object using. If we let
that guide us we come up with DOM, PullDOM, QPXML and SAX. SAX, then,
shouldn't be wasting time building objects. Leave that to EasySAX,
PullDOM, EventDOM, or whatever else.

SAX has always had multiple personalities in this way. It isn't really
simple. It isn't as efficient as it could be. It certainly isn't as
usable as it could be. I'm partial to more clear-cut, single-purpose
designs.

> I'm sorry to write an email that may seem so harsh, but I am really
> convinced that what you are proposing is really really wrong and that
> it is very important that we don't do this. The end result will, I
> think, be to make SAX a real pain to use, and I think the speed
> benefit is likely to be less than 5% for a reasonable application.

Let's say the speedup is "only" 5%, but 97% of all SAX-using programs
never use the SAX API directly anyhow? Then that speedup comes "for
free" from the point of view of those programmers.  It's like a tweak in
the bowels of Python that makes Guido's job a little bit harder but
makes Python run faster for everyone. It's like rewriting a Python
module in C for the sake of all of the apps built on top of it. That's
where I think we are going. Heck, I woudn't be suprised if we see Python
SAX producers and consumers both written in C soon. I think of it as a
device driver API.

> Hoping this is not too late...

No, it isn't too late and I consider SAX your domain. I just did what
seemed to me to be the best thing for performance when it comes to names
and I haven't touched AttributeList yet.

-- 
 Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it 
gave rise made modern science possible, but it was the algorithm that 
made the modern world possible.
	- The Advent of the Algorithm (pending), by David Berlinski