Parsing complex web pages safely with htmllib.HTMLParser

Joonas Paalasmaa joonas at
Thu Jan 24 12:43:55 CET 2002

abulka at (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'
> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist
> MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages?  Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML.  What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...
> Or is this just a bug in htmllib.HTMLParser ?

Use HTML Tidy to clean up the page and then parse it with HTMLParser.

Tidy project page:
Python interface to tidy:

More information about the Python-list mailing list