[python-advocacy] Python makes the "most wanted list"
Paul Boddie
paul at boddie.org.uk
Tue Feb 12 00:22:26 CET 2008
On Monday 11 February 2008 21:09:23 Michael Foord wrote:
> Laura Creighton wrote:
> > I'm having a hard time understanding what is going on here.
> >
> > Do we have a library which for no reason just calls up the WC3 site
> > and pesters them?
Yes.
[...]
> The library doesn't do it deliberately. It is probably caused by
> applications parsing html and following all links (extracted with
> regular expressions). The W3C schema definitions look like links (they
> are URLs) - so these pages get unnecessarily fetched millions of times.
This is not the case, but it's interesting to see how everyone jumped to that
conclusion but didn't bother to do a search on the standard library. If you
do so, there are two places which stand out:
xml/dom/xmlbuilder.py
xml/sax/saxutils.py
What gives them away is the way as the cause of the described problem is that
they are both fetching things which are given as "system identifiers" - the
things you get in the document type declaration at the top of an XML document
which look like a URL.
If you then put some trace statements into the code and then try and parse
something using, for example, the xml.sax API, it becomes evident that by
default the parser attempts to fetch lots of DTD-related resources, not
helped by the way that stuff like XHTML is now "modular" and thus employs
lots of separate files in the DTD. This is obvious because you get something
like this printed to the terminal:
saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod
Of course, the "best practice" with APIs like SAX is that you define your own
resolver or handler classes which don't go and fetch DTDs from the W3C all
the time, but this isn't the "out of the box" behaviour. Instead,
implementers have chosen the most convenient behaviour which arguably
involves the least effort in telling people how to get hold of DTDs so that
they may validate their documents, but which isn't necessarily the "right
thing" in terms of network behaviour. Naturally, since defining specific
resolvers/handlers involves a lot of boilerplate (and you should try it in
Java!) then a lot of developers just incur the penalty of having the default
behaviour, instead of considering the finer points of the various W3C
specifications (which is never really any fun).
Anyway, I posted a comment saying much the same on the blog referenced at the
start of this thread, but we should be aware that this is default standard
library behaviour, not rogue application developer behaviour.
Paul
More information about the Advocacy
mailing list