[Tutor] Help with Parsing HTML files

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Sat, 4 Aug 2001 12:39:21 -0700 (PDT)

On Fri, 3 Aug 2001, Sean 'Shaleh' Perry wrote:

> > What I don't see is how the handle_image function/method looks for
> images and > I need to learn how to use this in order to modify it for
> my own dark > purposes! Please help. >

Maybe an example will help --- Here's a small example that, given a web
site, tries to pull out all the image names.  (Rob, here's another useless
python script.  *grin*)

This example will use htmllib to help us "parse" and hunt down IMG tags.  
I don't think we need to explicitly rewrite handle_image().  For HTML
elements that have start and end tags, let's define
"start_nameofsometag()" and "end_nameofsometag()" methods.  However, since
an IMG tag stands alone, we'll write a do_img() method instead.

import htmllib
import formatter
import sys
import urllib

class ImagePuller(htmllib.HTMLParser):
    def __init__(self):
        self.list_of_images = []

    def do_img(self, attributes):
        for name, value in attributes:
            if name == 'src':
                new_image = value

    def getImageList(self):
        return self.list_of_images

if __name__ == '__main__':
    url = sys.argv[1]
    url_contents = urllib.urlopen(url).read()
    puller = ImagePuller()
    print puller.getImageList()

For more information about this, take a look at: