adding the XML to 2.0 to be a mistake?

Paul Prescod paulp at ActiveState.com
Thu Jan 18 23:41:21 EST 2001


Robert Roy wrote:
> 
> ...
>
> I agree with what you are saying. Another aspect that concerns me is
> that with the addition of the XML tools, xmllib is now deprecated. The
> recommended alternative, SAX, does not offer the level of control that
> xmllib does.

xmllib has a big problem: it doesn't parse XML correctly. No matter what
its API virtues, I consider that a flaw that disqualifies it for the
position of "standard Python parsing library."

At the bottom I'll attach a perfectly valid XML document that the Python
2.0 xmllib will not support without MAJOR hacking on the part of the
application developer.

As to your complaints about SAX. The one constant in the world of XML
and SGML, going back over 10 years is that people always complain that
their parsers do not give them low enough access to the parse stream.
This is because parseres are optimized for the *average case*. In the
average case, you don't want to do your own entity lookups.

If what you want is an XML tokenizer, then perhaps xmllib is just right.
But even then it isn't perfect because it will return your attributes in
an arbitrary order, not the order they were specified in the document. I
need an XML tokenizer for a project I am working on and I am going to
have to wrap Expat's XMLTok API.

I'm disputing the usefulness of XML tokenizers -- I'm disputing their
relative utility compared to XML parsers.

>  For several tasks (eg: translation to another DTD/Schema) it is
> desireable not to resolve any character entities including the
> standard XML entity defs. ...

Wat you want isn't an XML parser. XML parsers are required by the
specification to resolve character entities.

> Undeclared entities are a problem in SAX but can be handled cleanly
> using the unknown_entityref mecanism in xmllib.

According to the XML specification, *all* entities must be declared. An
XML parser is required to check that.

Here's the example XML document:

<!DOCTYPE Element [
<!ELEMENT Element ANY>
<!ENTITY abcdef "<Element/>">
]>
<Element>&abcdef;</Element>

SAX correctly reports two elements. xmllib incorrectly reports one. And
it isn't fair to come back with: "Who cares about entities" because you
were complaining about SAX's entity support. The real question is
whether you would rather have low-level vs. correct entity handling. The
answer is: a tokenizer should have low-level, a parser should have
correct.

 Paul Prescod




More information about the Python-list mailing list