[XML-SIG] [Baypiggies] News flash: Python possibly guilty in excessive DTD traffic
mike at skew.org
Sun Feb 17 13:36:00 CET 2008
Before looking for a bug, create a test case and verify that the behavior
isn't expected for it.
I mean, of *course* there'll be an attempt to fetch whatever DTD is mentioned
in a DOCTYPE when your XML processor is validating, and it's quite reasonable
to fetch one even when not validating, because there's more info in a DTD than
just what's needed for validation.
AFAICT, the main problem the W3C is talking about is not what happens when a
legitimate DTD request occurs in response to a system ID in a DOCTYPE, but
rather when there really shouldn't be such a request -- that is, when the
DTD's URL is just a namespace ID.
What evidence is there that Python's standard XML libs are making illegitimate
requests for namespace IDs? I see none in that W3C blog post. Show us a
reproducible example of a namespace ID being subjected to a fetch attempt
while reading in an XML document with standard Python APIs. I don't think it's
happening at all.
Apparently there *is* evidence that urllib is ultimately called by something
quite often to grab XHTML DTDs, and the HTTP response may not always be
handled very well. But assuming it's part of normal XML processing, we have no
details about whether it's a legitimate call for a DOCTYPE or an illegit one
for a namespace ID, and whether it's really unreasonable to keep trying to
fetch every time the reference is encountered. It sounds like
application-level issues, not misbehavior by Python's SAX or DOM APIs.
That blog author also seems to feel it's unreasonable for an app to seek out
the same network-bound resource repeatedly, which is a sound position in some
document and application contexts, but not others; it really depends on the
situation, doesn't it? Sure, an app developer might be able to configure the
parser to not read external entities, or could cache responses to minimize
that traffic, if necessary, but it's not an obligation or necessarily a bug if
that doesn't happen. And the XML spec is silent on the issue of unfetchable
external entities anyway.
To answer your question, legitimate DTD processing is probably a feature of
the underlying parser (Expat). I assume it calls back to a urllib-based
resolver. But like I said, there's no bug there; just a lack of features to
encourage application developers to use XML catalogs.
I don't know if this helps.. or am I missing something here?
Guido van Rossum wrote:
> On Feb 8, 2008 8:03 PM, Keith Dart ? <keith at dartworks.biz> wrote:
> > http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
> > This is interesting. I've noticed that when you use Python's XML
> > package in validating mode it does try to fetch the DTD. Be careful
> > when you use that.
> I think this is worth filing a bug, but I'd like to understand better
> where the call is made. I can't find any places in the standard xml
> package that does this -- but I'm not all that familiar with the code.
> Do you know if it's in the base xml package, or in etree, or in the
> separately distributed "XMLplus"? Any details you have would be
> appreciated (like a traceback from the point where the call is made).
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> XML-SIG maillist - XML-SIG at python.org
More information about the XML-SIG