HTMLparsing abnormal html pages

Fri Mar 16 19:11:32 EST 2001

In article <B6D81330.4FDA%f8dy at diveintopython.org>,
Mark Pilgrim  <f8dy at diveintopython.org> wrote:
>in article 98pvp1$15t$1 at news.netmar.com, asle at spam.com at asle at spam.com
>wrote on 3/15/01 3:50 AM:
>>
>> Considering the small program below. Running it will show that the
>> HTMLparser
>> is truncating urls in the HTML page.
>> [...]
>> import htmllib
>> [...]
>> One solution is of course to preprosess the whole HTML page and
>> replacing invalid URL's whith valid URL's (using regex??), however I have
>
>Don't use htmllib, use sgmllib.  It does exactly what you want: uses regular
>expressions to pull out the tags and attributes of potentially messy HTML,
>then calls methods on itself based on the tags.  You can subclass it and
>provide methods for each tag.

That doesn't work for truly malformed HTML.  For example, sgmllib will
still produce incorrect results for something like this:

<a href="asdfasdfasdf>malformed link</a>
-- 
                      --- Aahz  <*>  (Copyright 2001 by aahz at pobox.com)

Androgynous poly kinky vanilla queer het Pythonista   http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6

Three sins: BJ, B&J, B&J