HTMLparsing abnormal html pages
Steve Purcell
stephen_purcell at yahoo.com
Sun Mar 18 04:52:58 EST 2001
Tim Roberts wrote:
> I have been searching for an HTML pretty-printer; something where I can
> feed an arbitrary page and get a more structured, indented view. I wrote a
> simple one myself, based on sgmllib; it does a fair job, but it is easily
> confused by such common offenses as omitted </p> tags. It sounds like your
> BaseHTMLProcessor might be such a thing. Is it available yet?
>
> If not, is anybody aware of a fair HTML cleaner-upper?
You could use the python-xml code to slurp the HTML into a DOM, and then
format it using HtmlLineariser:
>>> from xml.dom.writer import HtmlLineariser
>>> from xml.dom.html_builder import HtmlBuilder
>>> builder = HtmlBuilder()
>>> builder.ignore_mismatched_end_tags = 1 # make less fussy
>>> html_text = open('public_html/index.html').read()
>>> builder.feed(html_text)
>>> pretty_printed = HtmlLineariser().linearise(builder.document)
pretty_printed is now a nicely indented version of html_text.
It's not the fastest thing in the world, but it might help you.
-Steve
--
Steve Purcell, Pythangelist
Get testing at http://pyunit.sourceforge.net/
Any opinions expressed herein are my own and not necessarily those of Yahoo
More information about the Python-list
mailing list