[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike

Tue Dec 2 11:58:46 EST 2003

On Tue, 02 Dec 2003 10:07:19 -0600
Simon Willison <cs1spw at bath.ac.uk> wrote:

> Stuart Langridge wrote:
> > I don't see that tidy's ability to tidy HTML per se is useful, but I
> > think that it's very useful in that it can take invalid HTML and
> > convert it to valid XHTML. That way, we can get a DOM tree from invalid
> > HTML, which is very useful...
> 
> Is there any way we could get a DOM tree from invalid HTML using pure 
> Python tools? The HTML tools in the Python standard library at the 
> moment are all pure Python. Could we even use the existing sgmllib 
> module (or an extension of it) to create our own DOM tree from invalid HTML?

According to the docs, tidylib exposes a DOM-like interface for walking the document tree of documents it has parsed. My understanding is that this is designed to work for broken HTML up to valid XHTML. If it works as advertised, it could be a good engine to put behind a nice python api.

See: http://tidy.sourceforge.net/docs/api/group__Tree.html

The API gets a bit verbose in places (separate functions to test for each tag and attribute type). These look like compliments to the generic functions, perhaps to avoid putting too much HTML knowledge directly in the user code.

Also, tidylib's memory allocation is hookable, in case we wanted to use Python's malloc/free (not sure whether we need to).

-Casey