[Tutor] Help with Parsing HTML files

Sean 'Shaleh' Perry shalehperry@home.com
Fri, 03 Aug 2001 23:55:00 -0700 (PDT)


> 
> What I don't see is how the handle_image function/method looks for images and
> I need to learn how to use this in order to modify it for my own dark 
> purposes! Please help.
> 

Ok, here is the comment from sgmllib.py (note, try reading the python modules,
they are often easy to follow):

# SGML parser base class -- find tags and call handler functions.
# Usage: p = SGMLParser(); p.feed(data); ...; p.close().
# The dtd is defined by deriving a class which defines methods
# with special names to handle tags: start_foo and end_foo to handle
# <foo> and </foo>, respectively, or do_foo to handle <foo> by itself.
# (Tags are converted to lower case for this purpose.)  The data
# between tags is passed to the parser by calling self.handle_data()
# with some data as argument (the data may be split up in arbutrary
# chunks).  Entity references are passed by calling
# self.handle_entityref() with the entity reference as argument.

Then a look at htmllib.py shows:

 def do_img(self, attrs):
    align = ''
    alt = '(image)'
    ismap = ''
    src = ''
    width = 0
    height = 0
    for attrname, value in attrs:
        if attrname == 'align':
            align = value
        if attrname == 'alt':
            alt = value
        if attrname == 'ismap':
            ismap = value
        if attrname == 'src':
            src = value
        if attrname == 'width':
            try: width = string.atoi(value)
            except: pass
        if attrname == 'height':
            try: height = string.atoi(value)
            except: pass
    self.handle_image(src, alt, ismap, align, width, height)

So, the class ImgFinder shown in your code implements handle_image, overriding
the one in htmllib's HTMLParser.

The code path looks something like this (in a psuedo code + python mix):

parse html
found tag
parse options in tag
if class defines start_tag: call start_tag(options)
elif class defines do_tag: call do_tag(options)
else: unknown_tag(options)

When the parser encounters <img>myimage.png</img> the class checks for:
start_img then do_img and if neither is found unknown_tag is called.  Since
htmllib defines do_img that function is called.  When self.handle_image is
called the one from ImgFinder is used instead of the one from the parent class.

Hope that helps.