[Tutor] Parsing HTML file [using HTML-Tidy]

Fri Dec 12 14:10:45 EST 2003

On Thu, 11 Dec 2003, Chris Heisel wrote:

> Actually, the HTML might very well be XHTML. The files I've examined
> have been valid XHTML, however the files were hand coded by a variety of
> folks so some of them might not be valid, would that cause the dom
> handler to spaz out?

Hi Chris,

Yeah, things would probably break without some validation --- the DOM
modules require "well-formed" xml.  The program HTML-Tidy is meant to take
messy HTML, though, and output a variety of well-formed outputs.

    http://tidy.sourceforge.net/

M.A. Lemburg has written a Python module that exposes HTML-Tidy nicely:

    http://www.lemburg.com/files/python/mxTidy.html

Here's a quicky example of it in action:

###
>>> from mx.Tidy import tidy
>>> results = tidy('<p>hello<p>paragraph 2', output_xhtml=1)
>>> print results[2]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<p>hello</p>

<p>paragraph 2</p>
</body>
</html>
###

So one way to approach the problem is to pass all HTML through HTML-Tidy,
force it into a nicer structure, and then do your extraction on that
well-formed data.

And even if you don't take a XML-parsing approach, it's still probably a
good idea to use HTML-Tidy as a first curation step, since it'll make the
HTML more regular and easier to process.  You have to assume that any
hand-written HTML has the possiblity of being really wacky --- better for
HTML-Tidy to try to figure it out than us.  *grin*