[XML-SIG] Xalan and Xerces...

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Tue, 17 Oct 2000 17:43:42 -0600


> Being confronted with Xerces for the first time, I took the
> opportunity to port their SAXCount example to PyXML, which took me
> half an hour (plus minus five minutes), including installing Xerces.
> 
> On my system (AMD K6, 350MHz, JDK 1.3.0beta-b07) I got the following
> results:
> 
> Xerces with no options:
> data/personal.xml: 903 ms (37 elems, 18 attrs, 26 spaces, 242 chars)
> Xerces with -w (i.e. parse the file once, then measure time for second run)
> data/personal.xml: 85 ms (37 elems, 18 attrs, 26 spaces, 242 chars)
> PyXML 0.6.1, expat as the parser:
> data/personal.xml: 0.0128449s (37 elems, 12 attrs,0 spaces, 268 chars)

Good stats to have on hand.  Thanks.

> First, you'll notice that Python beats Java by an order of magnitude
> even in the "fast" java case. I'm not really surprised - expat is a
> fast parser, and it is written in C.
> 
> Next, you'll notice that expat does not report ignorableWhitespace;
> instead, the spaces are reported as character data. I'm not sure which
> one is right here (or whether both are acceptable) - both parsers
> operate in a non-validating mode. Somebody cares to clarify.

There is really no such thing as ignorable whitespace in non-validating mode.  
According to XML 1.0, white-space can only be ignored when it occurs where the 
is no corresponding #PCDATA in the content model from the DTD.  Since the DTD 
is not used in non-validating mode, the parser _cannot_ make assumptions that 
it's ignorable.

So in this case expat is right and Xerces is wrong.


> The difference in number of attributes apparently comes from Xerces
> passing the default value for an implied attribute from the DTD,
> whereas expat doesn't.

Since expat is strictly non-validating, this is quite valid.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python