[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike [was: Re: Python version...]

Mon Dec 1 15:18:33 EST 2003

On Sun, 30 Nov 2003 21:13:54 +0000 (GMT)
John J Lee <jjl at pobox.com> wrote:

> On Sun, 30 Nov 2003, Stuart Langridge wrote:
> > John J Lee spoo'd forth:
> [...]
> > > Is this aimed at the standard library?  xml.dom.ext.reader.HtmlLib?
> [...]
> > Um. What I was looking for was something that could parse HTML
> > (including invalid HTML) and give me a DOM tree. I tried Twisted's
> 
> Fine, but what we're talking about here is what should go into Python's
> standard library.
> 
> [...]
> > I think
> > that a DOM parser for HTML is pretty important, even if that parser
> > *actually* just does "convert broken HTML to valid XHTML and then feed
> > it to minidom" or something similar. Are there any others?
> 
> There are lots of XML DOM implementations for Python (only one HTML DOM
> implementation, though: 4DOM -- and that's out of date), including the one
> that's already in the standard library.  Parsing arbitrary HTML is hard,
> though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
> DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
> the wild ;-).  tidylib is the only sane way I know of.  See below.

Hmmm, it sounds to me like implementing/updating the HTML parsing built into python is something worth considering if it blocks several other possible initiatives.

HTML may be on the way out, but I think we're stuck with it for the forseeable future.

-Casey (running away ;^)