HTMLparsing abnormal html pages

Steve Purcell stephen_purcell at
Sun Mar 18 10:52:58 CET 2001

Tim Roberts wrote:
> I have been searching for an HTML pretty-printer; something where I can
> feed an arbitrary page and get a more structured, indented view.  I wrote a
> simple one myself, based on sgmllib; it does a fair job, but it is easily
> confused by such common offenses as omitted </p> tags.  It sounds like your
> BaseHTMLProcessor might be such a thing.  Is it available yet?
> If not, is anybody aware of a fair HTML cleaner-upper?

You could use the python-xml code to slurp the HTML into a DOM, and then
format it using HtmlLineariser:

>>> from xml.dom.writer import HtmlLineariser
>>> from xml.dom.html_builder import HtmlBuilder
>>> builder = HtmlBuilder()
>>> builder.ignore_mismatched_end_tags = 1   # make less fussy 
>>> html_text = open('public_html/index.html').read()
>>> builder.feed(html_text)
>>> pretty_printed = HtmlLineariser().linearise(builder.document)

pretty_printed is now a nicely indented version of html_text.

It's not the fastest thing in the world, but it might help you.


Steve Purcell, Pythangelist
Get testing at
Any opinions expressed herein are my own and not necessarily those of Yahoo

More information about the Python-list mailing list