[Tutor] Parsing HTML file [using HTML-Tidy]
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Fri Dec 12 14:10:45 EST 2003
On Thu, 11 Dec 2003, Chris Heisel wrote:
> Actually, the HTML might very well be XHTML. The files I've examined
> have been valid XHTML, however the files were hand coded by a variety of
> folks so some of them might not be valid, would that cause the dom
> handler to spaz out?
Hi Chris,
Yeah, things would probably break without some validation --- the DOM
modules require "well-formed" xml. The program HTML-Tidy is meant to take
messy HTML, though, and output a variety of well-formed outputs.
http://tidy.sourceforge.net/
M.A. Lemburg has written a Python module that exposes HTML-Tidy nicely:
http://www.lemburg.com/files/python/mxTidy.html
Here's a quicky example of it in action:
###
>>> from mx.Tidy import tidy
>>> results = tidy('<p>hello<p>paragraph 2', output_xhtml=1)
>>> print results[2]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<p>hello</p>
<p>paragraph 2</p>
</body>
</html>
###
So one way to approach the problem is to pass all HTML through HTML-Tidy,
force it into a nicer structure, and then do your extraction on that
well-formed data.
And even if you don't take a XML-parsing approach, it's still probably a
good idea to use HTML-Tidy as a first curation step, since it'll make the
HTML more regular and easier to process. You have to assume that any
hand-written HTML has the possiblity of being really wacky --- better for
HTML-Tidy to try to figure it out than us. *grin*
More information about the Tutor
mailing list