[Tutor] Help with Parsing HTML files
Sean 'Shaleh' Perry
shalehperry@home.com
Fri, 03 Aug 2001 23:55:00 -0700 (PDT)
>
> What I don't see is how the handle_image function/method looks for images and
> I need to learn how to use this in order to modify it for my own dark
> purposes! Please help.
>
Ok, here is the comment from sgmllib.py (note, try reading the python modules,
they are often easy to follow):
# SGML parser base class -- find tags and call handler functions.
# Usage: p = SGMLParser(); p.feed(data); ...; p.close().
# The dtd is defined by deriving a class which defines methods
# with special names to handle tags: start_foo and end_foo to handle
# <foo> and </foo>, respectively, or do_foo to handle <foo> by itself.
# (Tags are converted to lower case for this purpose.) The data
# between tags is passed to the parser by calling self.handle_data()
# with some data as argument (the data may be split up in arbutrary
# chunks). Entity references are passed by calling
# self.handle_entityref() with the entity reference as argument.
Then a look at htmllib.py shows:
def do_img(self, attrs):
align = ''
alt = '(image)'
ismap = ''
src = ''
width = 0
height = 0
for attrname, value in attrs:
if attrname == 'align':
align = value
if attrname == 'alt':
alt = value
if attrname == 'ismap':
ismap = value
if attrname == 'src':
src = value
if attrname == 'width':
try: width = string.atoi(value)
except: pass
if attrname == 'height':
try: height = string.atoi(value)
except: pass
self.handle_image(src, alt, ismap, align, width, height)
So, the class ImgFinder shown in your code implements handle_image, overriding
the one in htmllib's HTMLParser.
The code path looks something like this (in a psuedo code + python mix):
parse html
found tag
parse options in tag
if class defines start_tag: call start_tag(options)
elif class defines do_tag: call do_tag(options)
else: unknown_tag(options)
When the parser encounters <img>myimage.png</img> the class checks for:
start_img then do_img and if neither is found unknown_tag is called. Since
htmllib defines do_img that function is called. When self.handle_image is
called the one from ImgFinder is used instead of the one from the parent class.
Hope that helps.