[Tutor] Help with Parsing HTML files [html/OOP]

Mon, 6 Aug 2001 01:43:14 -0700 (PDT)

On Mon, 6 Aug 2001, Charlie Clark wrote:

> >###
> >import htmllib
> >import formatter
> >import sys
> >import urllib
> >
> >class ImagePuller(htmllib.HTMLParser):
> >    def __init__(self):
> >        htmllib.HTMLParser.__init__(self,
> >                                    formatter.NullFormatter())
> >        self.list_of_images = []
> >
> >    def do_img(self, attributes):
> >        for name, value in attributes:
> >            if name == 'src':
> >                new_image = value
> >                self.list_of_images.append(new_image)
> >
> >    def getImageList(self):
> >        return self.list_of_images
> >
> >if __name__ == '__main__':
> >    url = sys.argv[1]
> >    url_contents = urllib.urlopen(url).read()
> >    puller = ImagePuller()
> >    puller.feed(url_contents)
> >    print puller.getImageList()
> >###
> >
> >
> >For more information about this, take a look at:
> >
> >    http://python.org/doc/current/lib/module-sgmllib.html
> I've looked at this but as it doesn't come with examples I'm stumped.

Don't worry too much yet: the documentation to htmllib assumes that you
already know about OOP style programming, as well as event-driving
programming.  If both are new topics, then this might take a little while
to figure out.

> If do_img is a method of ImagePuller, when is it called and how does it look 
> for images?  It seems to do this via tag analysis where a tag contains "src" 

Yes, there's some analysis being done behind the scenes in sgmllib.  The
do_img() method actually gets called during the feed()ing process:

    puller.feed()

feed() is something that's defined in sgmllib.  Whenever it encounters a
new tag, it dynamically tries to call an appropriately named method.  If
it runs into a P tag, for example, it'll try calling "start_p".  Let's
take a small look at sgmllib for a second:

###
    def finish_starttag(self, tag, attrs):
        try:
            method = getattr(self, 'start_' + tag)
        except AttributeError:
            try:
                method = getattr(self, 'do_' + tag)
            except AttributeError:
                self.unknown_starttag(tag, attrs)
                return -1
            else:
                self.handle_starttag(tag, method, attrs)
                return 0
###

Here, sgmllib's parser tries to call methods by "permission": it just goes
ahead and tries calling start_sometagname(), and if something bad happens,
it backs away.  No harm done.

By subclassing and defining our own start_sometagname/do_sometagname for
the tags that we're interested in, we allow those permissive method calls
to go through.

> for images? It seems to do this via tag analysis where a tag contains
> "src"  which won't just be images but any inline element (object,
> script, frame, etc.) and how do I overwrite or modify it to work with
> specific tags together with specific content. The pages I'm pulling in

For each specific tag, we can add an additional method with a particular
tag name.  For example, to get the parsing to pay attention to bold tags
and underlines, we can do something like this:

###
class EmphasisGlancer(htmllib.HTMLParser):
    def __init__(self):
        htmllib.HTMLParser.__init__(self,
                                    formatter.NullFormatter())
        self.in_bold = 0
        self.in_underline = 0

    def start_b(self, attrs):
        print "Hey, I see a bold tag!"
	self.in_bold = 1

    def end_b(self):
        self.in_bold = 0

    def start_u(self, attrs):
        print "Hey, I see some underscored text!"
        self.in_underline = 1

    def end_u(self):
        self.in_underline = 0

    def start_blink(self, attrs):
        print "Hey, this is some heinously blinking test... *grrrr*"

    def handle_data(self, data):
        if self.in_bold:
             print "BOLD:", data
        elif self.in_underline:
             print "UNDERLINE:", data
###

I have not tested this code yet, but hopefully I haven't made too many
typos.  If you play around with it, you might find that sgmllib/htmllib
isn't as bad as you think.

I should stop here, since the message is too long.  I'll try answering
your second question tomorrow.  Talk to you later!