HTMLparsing abnormal html pages

Sun Mar 18 04:52:58 EST 2001

Tim Roberts wrote:
> I have been searching for an HTML pretty-printer; something where I can
> feed an arbitrary page and get a more structured, indented view.  I wrote a
> simple one myself, based on sgmllib; it does a fair job, but it is easily
> confused by such common offenses as omitted </p> tags.  It sounds like your
> BaseHTMLProcessor might be such a thing.  Is it available yet?
> 
> If not, is anybody aware of a fair HTML cleaner-upper?

You could use the python-xml code to slurp the HTML into a DOM, and then
format it using HtmlLineariser:

>>> from xml.dom.writer import HtmlLineariser
>>> from xml.dom.html_builder import HtmlBuilder
>>> builder = HtmlBuilder()
>>> builder.ignore_mismatched_end_tags = 1   # make less fussy 
>>> html_text = open('public_html/index.html').read()
>>> builder.feed(html_text)
>>> pretty_printed = HtmlLineariser().linearise(builder.document)

pretty_printed is now a nicely indented version of html_text.

It's not the fastest thing in the world, but it might help you.

-Steve

-- 
Steve Purcell, Pythangelist
Get testing at http://pyunit.sourceforge.net/
Any opinions expressed herein are my own and not necessarily those of Yahoo