Parsing complex web pages safely with htmllib.HTMLParser
joonas at olen.to
Thu Jan 24 12:43:55 CET 2002
abulka at netspace.net.au (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at posting.google.com>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it. It works for everything I've tried but not the
> page I really want
> which lists the weather in my state. Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'
> import htmllib
> import formatter
> print parser.anchorlist
> MY QUESTION: Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages? Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML. What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...
> Or is this just a bug in htmllib.HTMLParser ?
Use HTML Tidy to clean up the page and then parse it with HTMLParser.
Tidy project page: http://tidy.sourceforge.net/
Python interface to tidy: http://www.lemburg.com/files/python/mxTidy.html
More information about the Python-list