[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike [was:
Re: Python version...]
casey at zope.com
Mon Dec 1 15:18:33 EST 2003
On Sun, 30 Nov 2003 21:13:54 +0000 (GMT)
John J Lee <jjl at pobox.com> wrote:
> On Sun, 30 Nov 2003, Stuart Langridge wrote:
> > John J Lee spoo'd forth:
> > > Is this aimed at the standard library? xml.dom.ext.reader.HtmlLib?
> > Um. What I was looking for was something that could parse HTML
> > (including invalid HTML) and give me a DOM tree. I tried Twisted's
> Fine, but what we're talking about here is what should go into Python's
> standard library.
> > I think
> > that a DOM parser for HTML is pretty important, even if that parser
> > *actually* just does "convert broken HTML to valid XHTML and then feed
> > it to minidom" or something similar. Are there any others?
> There are lots of XML DOM implementations for Python (only one HTML DOM
> implementation, though: 4DOM -- and that's out of date), including the one
> that's already in the standard library. Parsing arbitrary HTML is hard,
> though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
> DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
> the wild ;-). tidylib is the only sane way I know of. See below.
Hmmm, it sounds to me like implementing/updating the HTML parsing built into python is something worth considering if it blocks several other possible initiatives.
HTML may be on the way out, but I think we're stuck with it for the forseeable future.
-Casey (running away ;^)
More information about the Web-SIG