Parsing complex web pages safely with htmllib.HTMLParser

Paul Boddie paul at boddie.net
Thu Jan 24 11:34:01 CET 2002


abulka at netspace.net.au (Andy Bulka) wrote in message news:<13dc97b8.0201232152.66d56faa at posting.google.com>...
> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
>   http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'

This may well be caused by the presence of a "script" element.
Currently, the various standard library HTML parsers don't seem to
deal with "script" elements very well, especially when they contain
"<" characters in the enclosed code. What you can do is to preprocess
the page text using a function which introduces "CDATA" notation
within such elements - something like this seems to work (at least in
conjunction with the xml.dom.ext.reader interface to these parsers):

  <script ...><![CDATA[
    ...
  ]]></script>

> import htmllib
> import formatter
> parser=htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(open('ATROUBLESOMECOMPLEXPAGE.htm').read())
> parser.close()
> print parser.anchorlist

I tend to use sgmllib.SGMLParser and I've been working on a Web page
which describes it in use. I think that the "Dive Into Python"
(http://www.diveintopython.org) site also covers SGMLParser.

> MY QUESTION:  Is htmllib.HTMLParser likely to fail here and there, on
> complex or otherwise web pages?  Loading the above page into Frontpage
> and saving it out again does nothing to fix the problem - so its
> proably ok HTML.  What do I do about this - ask my Government Bureau
> of Meteorology to change the way they do their web pages ?!! Of course
> I can catch the exception, but I REALLY *want* the info on that
> weather page...

You probably won't have much luck asking people to change their pages,
especially if they are dynamic pages, produced by some dodgy
templating language. Another hint: if you still can't make any sense
out of a broken Web page, introduce mxTidy into your "processing
pipeline"...

  http://www.lemburg.com/files/python/mxTidy.html

Of course, what we all really need is for XHTML to come into
widespread use, so that we can consign broken HTML to history.

Paul



More information about the Python-list mailing list