[XML-SIG] Xalan and Xerces...

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 18 Oct 2000 01:22:45 +0200


> I understand that the fourthought offering might well be a good one, but I
> am a little confused as to what the fourthought offering might bring to
> Python that Xalan and Xerces would not, and why the fourthought offering has
> been blessed by the SIG for pyXML. [genuine pondering, not rhetorical]

I haven't used Xalan or Xerces - but how exactly would you integrated
into PyXML? I think that is technically not feasible, at least not in
a 99% pure Python approach.

In addition, PyXML supports a number of parsers - validating ones,
fast ones, and super fast ones - I'd see no point in adding another
parser, unless it provides features not found in any of the existing
parsers.

> Windows developers are in the fortunate position of having a
> reference parser, MSXML to develop for. Any product aimed at the
> Windows platform can reasonably stipulate a requirement for MSXML.

Python, with Python 2, is also in the fortunate position of having a
reference parser - xml.parsers.expat. With PyXML, you get xmlproc and
sgmlop in addition to that.

> Xalan and Xerces I would assert are the most likely candidates to become the
> cross platform (or indeed Unix focussed with Windows availability)
> reflection of MSXML. 

Nah, can't be :-) Python 2 (shipping today) already provides
cross-platform XML parsing - for Python. There is nothing wrong with
Xerces providing the same thing for Java - although I'd prefer a
parser running in compiled code any time.

> A stable, well documented, widely distributed XML parser and XSL
> processing engine.

For the PyXML parsers, I think pretty much the same can be said.

> Also given that C++ and Java versions of Xalan and Xerces are available,
> this would have to me at least seemed a perfect fit for Python and JPython
> both.

I can't really comment on the quality of the C++ version of Xerces - I
can't emagine it is completely compatible to the Java version,
though. Even if it was, arranging the same *Python* interface to both
might be a challenge.

> Why has fourthought's offering been chosen over Xalan and Xerces?
> [again genuine question, not rhetorical]

Please understand that PyXML is *not* a Fourthought offering. They
have provided the DOM implementation, and they will provide the XSLT
implementation - the parsers come from many other sources.

Being confronted with Xerces for the first time, I took the
opportunity to port their SAXCount example to PyXML, which took me
half an hour (plus minus five minutes), including installing Xerces.

On my system (AMD K6, 350MHz, JDK 1.3.0beta-b07) I got the following
results:

Xerces with no options:
data/personal.xml: 903 ms (37 elems, 18 attrs, 26 spaces, 242 chars)
Xerces with -w (i.e. parse the file once, then measure time for second run)
data/personal.xml: 85 ms (37 elems, 18 attrs, 26 spaces, 242 chars)
PyXML 0.6.1, expat as the parser:
data/personal.xml: 0.0128449s (37 elems, 12 attrs,0 spaces, 268 chars)

First, you'll notice that Python beats Java by an order of magnitude
even in the "fast" java case. I'm not really surprised - expat is a
fast parser, and it is written in C.

Next, you'll notice that expat does not report ignorableWhitespace;
instead, the spaces are reported as character data. I'm not sure which
one is right here (or whether both are acceptable) - both parsers
operate in a non-validating mode. Somebody cares to clarify.

The difference in number of attributes apparently comes from Xerces
passing the default value for an implied attribute from the DTD,
whereas expat doesn't.

See for the source of that ported example below.

> If Python had production quality XML/XSL support and a core Apache
> module (I realise there are two or more such modules existing, but
> again IMHO they are not well focused by the community, and of
> unverified / unproven strength) then Python could capitalise on a
> cross-platform web-development role.

I think Python does capitalise on a cross-platform web-development
role. However, if you think more needs to be done - just go ahead and
do it :-)

> In an ideal world a Python DOM/XPath/XSLT wrapper that could mask
> either a Xalan/Xerces or MSXML core, with an automatic switch
> dependant upon platform and availability might start to qualify for
> the term "full XML support".

I can imagine using MSXML when that is available, and Xerces when it
is available (i.e. in JPython). That should be as simple to support as
adding SAX drivers. However, that is not strictly necessary - Python
has "full XML support" right now.

> My own personal view is that such a Web development niche focused upon ease
> of XML development is essential for Pythons long term viability as a
> development language (as opposed to a spare wrench in the toolbox).

That is a little bit too much of marketing speak for me. I will
continue to use Python as long as it is useful for me - regardless of
others considering it viable for something or not.

> I would like to have developers think of Python and XML in the same
> way as they think of Perl and regular expressions.

I don't think spreading FUD that Python currently does not support XML
does help for that, though...

Regards,
Martin

# Example adapted from Xerces' sax.SAXCount
from xml.sax import ContentHandler, make_parser
from xml.sax.handler import feature_namespaces
from time import time

setValidation = 0
setNameSpaces = 1
setSchemaSupport = 1
warmup = 0

class SAXCount(ContentHandler):
    def startDocument(self):
        if warmup:return
        self.elems = 0
        self.attrs = 0
        self.chars = 0
        self.spaces = 0

    def startElementNS(self,name,qname,attrs):
        if warmup:return
        self.elems += 1
        self.attrs += len(attrs)

    def characters(self,chars):
        if warmup:return
        self.chars += len(chars)

    def ignorableWhitespace(self,chars):
        if warmup:return
        self.spaces += len(chars)

    def printResults(self, uri, time):
        print "%s: %gs" % (uri, time),
        print "(%(elems)d elems, %(attrs)d attrs,"\
              "%(spaces)d spaces, %(chars)d chars)" %\
              vars(self)

def printit(uri):
    global warmup
    counter = SAXCount()
    parser = make_parser()
    parser.setContentHandler(counter)
    # not setting error handler
    # parser.setFeature(feature_validation, setValidation)
    parser.setFeature(feature_namespaces, setNameSpaces)
    # parser.setFeature(feature_schema, setSchema)
    parser.parse(uri)
    if warmup:
        parser.parse(uri)
        parser.reset()
        warmup = 0
    start = time()
    parser.parse(uri)
    counter.printResults(uri,time()-start)

if __name__=='__main__':
    # todo: argument processing
    import sys,getopt
    opts, args = getopt.getopt(sys.argv[1:], "w")
    for opt,val in opts:
        if opt == '-w':
            warmup = 1
    printit(args[0])