[XML-SIG] [Baypiggies] News flash: Python possibly guilty in excessive DTD traffic

Sun Feb 17 13:36:00 CET 2008

Before looking for a bug, create a test case and verify that the behavior 
isn't expected for it.

I mean, of *course* there'll be an attempt to fetch whatever DTD is mentioned 
in a DOCTYPE when your XML processor is validating, and it's quite reasonable 
to fetch one even when not validating, because there's more info in a DTD than 
just what's needed for validation.

AFAICT, the main problem the W3C is talking about is not what happens when a 
legitimate DTD request occurs in response to a system ID in a DOCTYPE, but 
rather when there really shouldn't be such a request -- that is, when the 
DTD's URL is just a namespace ID.

What evidence is there that Python's standard XML libs are making illegitimate 
requests for namespace IDs? I see none in that W3C blog post. Show us a 
reproducible example of a namespace ID being subjected to a fetch attempt 
while reading in an XML document with standard Python APIs. I don't think it's 
happening at all.

Apparently there *is* evidence that urllib is ultimately called by something 
quite often to grab XHTML DTDs, and the HTTP response may not always be 
handled very well. But assuming it's part of normal XML processing, we have no 
details about whether it's a legitimate call for a DOCTYPE or an illegit one 
for a namespace ID, and whether it's really unreasonable to keep trying to 
fetch every time the reference is encountered. It sounds like 
application-level issues, not misbehavior by Python's SAX or DOM APIs.

That blog author also seems to feel it's unreasonable for an app to seek out 
the same network-bound resource repeatedly, which is a sound position in some 
document and application contexts, but not others; it really depends on the 
situation, doesn't it? Sure, an app developer might be able to configure the 
parser to not read external entities, or could cache responses to minimize 
that traffic, if necessary, but it's not an obligation or necessarily a bug if 
that doesn't happen. And the XML spec is silent on the issue of unfetchable 
external entities anyway.

To answer your question, legitimate DTD processing is probably a feature of 
the underlying parser (Expat). I assume it calls back to a urllib-based 
resolver. But like I said, there's no bug there; just a lack of features to 
encourage application developers to use XML catalogs.

I don't know if this helps.. or am I missing something here?

Guido van Rossum wrote:
> [+xml-sig]
> 
> On Feb 8, 2008 8:03 PM, Keith Dart ? <keith at dartworks.biz> wrote:
> >
> > http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
> >
> > This is interesting. I've noticed that when you use Python's XML
> > package in validating mode it does try to fetch the DTD. Be careful
> > when you use that.
> 
> I think this is worth filing a bug, but I'd like to understand better
> where the call is made. I can't find any places in the standard xml
> package that does this -- but I'm not all that familiar with the code.
> Do you know if it's in the base xml package, or in etree, or in the
> separately distributed "XMLplus"? Any details you have would be
> appreciated (like a traceback from the point where the call is made).
> 
> -- 
> --Guido van Rossum (home page: http://www.python.org/~guido/)
> _______________________________________________
> XML-SIG maillist  -  XML-SIG at python.org
> http://mail.python.org/mailman/listinfo/xml-sig