Parsing complex web pages safely with htmllib.HTMLParser

Mon Jan 28 16:28:46 EST 2002

abulka at netspace.net.au (Andy Bulka) writes:

> The following snippet of code parses a web page on my disk and prints
> the urls found in it.  It works for everything I've tried but not the
> page I really want
>   http://www.bom.gov.au/cgi-bin/wrap_fwo.pl?IDV60029.html
> which lists the weather in my state.  Intead I get an exception
> SGMLParseError: unexpected char in declaration: '<'

For just greping URLs out of some broken HTML a regular
expression might be more fruitful. To get past all that really
bad HTML one must craft the regular expression carefully.
Something like

<\s*a[^>]href\s?=\s?"?(.+?)[">\s]

drt

-- 
teenage mutant ninja hero coders from da c0re - http://c0re.jp/
me - http://koeln.ccc.de/~drt/