[Tutor] Help with Parsing HTML files [html/OOP]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Mon, 6 Aug 2001 01:43:14 -0700 (PDT)
On Mon, 6 Aug 2001, Charlie Clark wrote:
> >###
> >import htmllib
> >import formatter
> >import sys
> >import urllib
> >
> >class ImagePuller(htmllib.HTMLParser):
> > def __init__(self):
> > htmllib.HTMLParser.__init__(self,
> > formatter.NullFormatter())
> > self.list_of_images = []
> >
> > def do_img(self, attributes):
> > for name, value in attributes:
> > if name == 'src':
> > new_image = value
> > self.list_of_images.append(new_image)
> >
> > def getImageList(self):
> > return self.list_of_images
> >
> >if __name__ == '__main__':
> > url = sys.argv[1]
> > url_contents = urllib.urlopen(url).read()
> > puller = ImagePuller()
> > puller.feed(url_contents)
> > print puller.getImageList()
> >###
> >
> >
> >For more information about this, take a look at:
> >
> > http://python.org/doc/current/lib/module-sgmllib.html
> I've looked at this but as it doesn't come with examples I'm stumped.
Don't worry too much yet: the documentation to htmllib assumes that you
already know about OOP style programming, as well as event-driving
programming. If both are new topics, then this might take a little while
to figure out.
> If do_img is a method of ImagePuller, when is it called and how does it look
> for images? It seems to do this via tag analysis where a tag contains "src"
Yes, there's some analysis being done behind the scenes in sgmllib. The
do_img() method actually gets called during the feed()ing process:
puller.feed()
feed() is something that's defined in sgmllib. Whenever it encounters a
new tag, it dynamically tries to call an appropriately named method. If
it runs into a P tag, for example, it'll try calling "start_p". Let's
take a small look at sgmllib for a second:
###
def finish_starttag(self, tag, attrs):
try:
method = getattr(self, 'start_' + tag)
except AttributeError:
try:
method = getattr(self, 'do_' + tag)
except AttributeError:
self.unknown_starttag(tag, attrs)
return -1
else:
self.handle_starttag(tag, method, attrs)
return 0
###
Here, sgmllib's parser tries to call methods by "permission": it just goes
ahead and tries calling start_sometagname(), and if something bad happens,
it backs away. No harm done.
By subclassing and defining our own start_sometagname/do_sometagname for
the tags that we're interested in, we allow those permissive method calls
to go through.
> for images? It seems to do this via tag analysis where a tag contains
> "src" which won't just be images but any inline element (object,
> script, frame, etc.) and how do I overwrite or modify it to work with
> specific tags together with specific content. The pages I'm pulling in
For each specific tag, we can add an additional method with a particular
tag name. For example, to get the parsing to pay attention to bold tags
and underlines, we can do something like this:
###
class EmphasisGlancer(htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self,
formatter.NullFormatter())
self.in_bold = 0
self.in_underline = 0
def start_b(self, attrs):
print "Hey, I see a bold tag!"
self.in_bold = 1
def end_b(self):
self.in_bold = 0
def start_u(self, attrs):
print "Hey, I see some underscored text!"
self.in_underline = 1
def end_u(self):
self.in_underline = 0
def start_blink(self, attrs):
print "Hey, this is some heinously blinking test... *grrrr*"
def handle_data(self, data):
if self.in_bold:
print "BOLD:", data
elif self.in_underline:
print "UNDERLINE:", data
###
I have not tested this code yet, but hopefully I haven't made too many
typos. If you play around with it, you might find that sgmllib/htmllib
isn't as bad as you think.
I should stop here, since the message is too long. I'll try answering
your second question tomorrow. Talk to you later!