[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike [was: Re: Python version...]

Sun Nov 30 16:13:54 EST 2003

On Sun, 30 Nov 2003, Stuart Langridge wrote:
> John J Lee spoo'd forth:
[...]
> > Is this aimed at the standard library?  xml.dom.ext.reader.HtmlLib?
[...]
> Um. What I was looking for was something that could parse HTML
> (including invalid HTML) and give me a DOM tree. I tried Twisted's

Fine, but what we're talking about here is what should go into Python's
standard library.

[...]
> I think
> that a DOM parser for HTML is pretty important, even if that parser
> *actually* just does "convert broken HTML to valid XHTML and then feed
> it to minidom" or something similar. Are there any others?

There are lots of XML DOM implementations for Python (only one HTML DOM
implementation, though: 4DOM -- and that's out of date), including the one
that's already in the standard library.  Parsing arbitrary HTML is hard,
though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
the wild ;-).  tidylib is the only sane way I know of.  See below.

> > Why isn't it a subclass of urllib.OpenerDirector (or, better, from
[...]
> Because I didn't know about it. This is because "urllib.urlopen" is
> hardwired into my fingers, and then I just overrode it with
> ClientCookie when I needed cookie handling. I'm entirely happy to have
> it work totally differently; this was really a proof-of-concept to get
> the ball rolling rather than a submission for direct inclusion.

Sure (you don't mean proof-of-concept, but I know what you mean).  I am
rolling that ball a bit :-)

[...]
> > I think there has to be some way of (optionally) linking up any browser
> > class to tidylib.
>
> I agree; tidylib is nice. AFAIK, though (and I probably am wrong) the
> only interface to Tidy is mxTidy, and I can never get it to install...

mxTidy is not an interface to tidylib.  mxTidy hacks the old HTMLTidy
source to make it into a shared library, and wraps it.  tidylib is a new
version, that basically does the same shared library-ization as Marc-Andre
did.  The difference is, it's actively maintained.  There's a Python
wrapper:

http://utidylib.sf.net/

which depends on ctypes.

Should tidylib be in the standard library?  On one hand, I lean towards
"no", because HTML is (in theory) on the way out.  OTOH, if it's going to
take another thirty years for HTML to completely go away, that may be a
silly attitude to take!  Opinions?  If it were to be in the std. lib., I
guess somebody would need to write a non-ctypes wrapper.

[ctypes itself would obviously be great to have in the standard library,
but that's up to Thomas Heller, and it's still under development.  More
importantly, it only works on Linux, Windows and MacOS X (and any other
platforms that libffi is ported to).]

[...]
> > No .forward() / .backward() methods?
>
> Didn't think of them until after I sent the message out. They'd be
> pretty trivial to implement, though, although I don't know what you'd
> do about the "This page contains POSTDATA" issue that browsers get.
[...]

You're allowed to do whatever you like, really (RFC 2616 section 13.13).

John