HTMLparsing abnormal html pages
Tim Roberts
timr at probo.com
Mon Mar 19 14:24:39 EST 2001
You wrote:
>
>Tim Roberts wrote:
>> ... If not, is anybody aware of a fair HTML cleaner-upper?
>
>You could use the python-xml code to slurp the HTML into a DOM, and then
>format it using HtmlLineariser:
>
>>>> from xml.dom.writer import HtmlLineariser
>>>> from xml.dom.html_builder import HtmlBuilder
>>>> builder = HtmlBuilder()
>>>> builder.ignore_mismatched_end_tags = 1 # make less fussy
>>>> html_text = open('public_html/index.html').read()
>>>> builder.feed(html_text)
>>>> pretty_printed = HtmlLineariser().linearise(builder.document)
>
>pretty_printed is now a nicely indented version of html_text.
>
>It's not the fastest thing in the world, but it might help you.
Thanks for taking the time to reply. Maybe I'm a bonehead, but I can't find the imports you've mentioned. I downloaded PyXML 0.6.4 (and 0.6.2 just to check), but HtmlLineariser, HtmlBuilder, and html_builder.py do not seem to exist. The documentation refers to them, and one of the test routines (test_htmlb.py) calls them, but they aren't in the xml/dom tree anywhere.
Has this interface been completely replaced? It looks to me like this:
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint
pretty_printed = PrettyPrint( HtmlLib.FromStream(...) )
performs the same function. Have I missed a clue somewhere?
--
- Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.
More information about the Python-list
mailing list