Parsing complex web pages safely with htmllib.HTMLParser
Andy Bulka
abulka at netspace.net.au
Fri Jan 25 02:33:21 EST 2002
Preprocessing the html page using tidy sure does the trick. The
Python interface to tidy:
http://www.lemburg.com/files/python/mxTidy.html is all I needed to
install - it installs into \python21\mx and you use it like this:
from mx.Tidy import *
cleanhtmltext = tidy(htmltext)[2]
The tidy function returns a tuple, where tuple positions 0 and 1 are
error and warning counts and tuple position 2 is the clean html.
thanks for the helpful responses!
Andy Bulka
www.atug.com/andypatterns
More information about the Python-list
mailing list