[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike

Wed Dec 3 09:20:02 EST 2003

On Wed, 3 Dec 2003, Stuart Langridge wrote:
> John J Lee spoo'd forth:
> > On Tue, 2 Dec 2003, Stuart Langridge wrote:
> >> Simon Willison spoo'd forth:
> >> > Is there any way we could get a DOM tree from invalid HTML using pure
> >> > Python tools? The HTML tools in the Python standard library at the
> >> Presumably we could (the existing things, like HtmlLib or microdom do
> >> it);
> >
> > No, they don't.  There's a whole wonderful world <wink> of invalid HTML
> > out there, that sgmllib and xml.dom.ext.reader.HtmlLib know nothing about.
>
> Really? What sort of thing do they fail to parse?

Hmm, I thought microdom used tidylib, but it seems not.  Haven't tried
that yet.  The problem is that tidylib has had a lot of input over many
years from people reporting bugs (where "bug" is very widely defined to
include failing to understand all kinds of bad HTML that one wouldn't
imagine people would write or browsers would put up with).  microdom
hasn't.  But maybe it works well enough.  It's not a full DOM
implementation, though.

BTW, I had thought of tidylib simply as a way of transforming HTML into
valid HTML or XHTML, not as a DOM implementation.  You could just have a
single tidy() function (like mxTidy, IIRC).

Here's some valid HTML that xml.dom.ext.reader.HtmlLib (from PyXML, and
based on sgmlop) fails to parse.

#!/usr/bin/env python

# Example from Martin v. Loewis (PyXML SF bug 409605).
# The missing optional <body> tag is not inferred.
good_html = """
 <html>
 I prefer (all things being equal)
 regularity/orthogonality and logical
 syntax/semantics in a language because there is less to
 have to remember.
 (Of course I know all things are NEVER really
 equal!)
 Guido van Rossum, 6 Dec 91
 The details of that silly code are irrelevant.
 Tim Peters, 4 Mar 92
 &amp; &lt; &gt; &eacute; &ouml; &nbsp;
 </html>
 """

from xml.dom.ext.reader.HtmlLib import FromHtml
from xml.dom.ext import XHtmlPrettyPrint

dom = FromHtml(good_html)
XHtmlPrettyPrint(dom)

That could be fixed.  Nobody has, probably because there are better XML
DOM parsers.

IIRC HTMLParser still doesn't handle CDATA properly (this one has annoyed
a lot of people, but I don't think anybody has fixed it yet).

For invalid HTML, it's true that badly-matched tags tend to work OK with
HTMLParser, but of course that just gives you "bad callbacks" instead of
bad HTML, if you get what I mean -- if you want to build a DOM out of
that, for example, good luck.  I suppose this is really the most important
issue.

Browsers seem to be full of code to parse or ignore the weirdest stuff
that even the underlying parser (HTMLParser, etc) choke on: I've seen
things that look like SGML declarations <!...> but didn't even seem to be
valid SGML, let alone HTML (but I don't know SGML).

John