HTMLparsing abnormal html pages

Mark Pilgrim f8dy at diveintopython.org
Fri Mar 16 19:00:00 EST 2001


in article 98pvp1$15t$1 at news.netmar.com, asle at spam.com at asle at spam.com
wrote on 3/15/01 3:50 AM:

> Considering the small program below. Running it will show that the
> HTMLparser
> is truncating urls in the HTML page.
> [...]
> import htmllib
> [...]
> One solution is of course to preprosess the whole HTML page and
> replacing invalid URL's whith valid URL's (using regex??), however I have

Don't use htmllib, use sgmllib.  It does exactly what you want: uses regular
expressions to pull out the tags and attributes of potentially messy HTML,
then calls methods on itself based on the tags.  You can subclass it and
provide methods for each tag.

from sgmllib import SGMLParser

class MessyURLParser(SGMLParser):
  def reset(self):
    SGMLParser.reset(self)
    self.urls = []
  def start_a(self, attrs):
    # at this point, attrs is a list of tuples (attrname, attrvalue)
    # attrname is converted to lowercase by SGMLParser
    # so for a tag <a HREF=index.html>, attrs would be
    # [('href', 'index.html')]
    hrefpair = [v for k, v in attrs if k=='href']
    if hrefpair:
      self.urls.append(hrefpair[0])

This is such a useful technique, I'm writing an entire chapter on it in my
book, "Dive Into Python".  The code example in the book is more complicated
than what you're looking for (it's designed to both consume and produce
HTML, whereas you want to consume HTML and produce a list), and the chapter
is barely started, but all the code is there along with a more complete
explanation of how sgmllib works:
  http://diveintopython.org/dialect_divein.html

Hope this helps.

-M
You're smart; why haven't you learned Python yet?
http://diveintopython.org/






More information about the Python-list mailing list