[Tutor] Help with Parsing HTML files

Mon, 06 Aug 2001 10:04:03 +0200

>Maybe an example will help --- Here's a small example that, given a web
>site, tries to pull out all the image names.  (Rob, here's another useless
>python script.  *grin*)
>
>This example will use htmllib to help us "parse" and hunt down IMG tags.  
>I don't think we need to explicitly rewrite handle_image().  For HTML
>elements that have start and end tags, let's define
>"start_nameofsometag()" and "end_nameofsometag()" methods.  However, since
>an IMG tag stands alone, we'll write a do_img() method instead.
>
>
>
>###
>import htmllib
>import formatter
>import sys
>import urllib
>
>class ImagePuller(htmllib.HTMLParser):
>    def __init__(self):
>        htmllib.HTMLParser.__init__(self,
>                                    formatter.NullFormatter())
>        self.list_of_images = []
>
>    def do_img(self, attributes):
>        for name, value in attributes:
>            if name == 'src':
>                new_image = value
>                self.list_of_images.append(new_image)
>
>    def getImageList(self):
>        return self.list_of_images
>
>if __name__ == '__main__':
>    url = sys.argv[1]
>    url_contents = urllib.urlopen(url).read()
>    puller = ImagePuller()
>    puller.feed(url_contents)
>    print puller.getImageList()
>###
>
>
>For more information about this, take a look at:
>
>    http://python.org/doc/current/lib/module-sgmllib.html
I've looked at this but as it doesn't come with examples I'm stumped.

If do_img is a method of ImagePuller, when is it called and how does it look 
for images? It seems to do this via tag analysis where a tag contains "src" 
which won't just be images but any inline element (object, script, frame, 
etc.) and how do I overwrite or modify it to work with specific tags together 
with specific content. The pages I'm pulling in are "contaminated" - HTML and 
content are horribly mixed so that I can't depend on looking at specific 
tabs.

My current script looks for a specific table beginning
<table border = "0" cellpadding = "0"
and then <b>Runde</b>

but the next one I have to write looks for a table cell and has a different 
separator.

I can parse the files with htmllib and sgmllib. I think sgmllib is probably 
more useful as I can check each tab and start "buffering" (I think that's the 
appropriate term) content if necessary providing I look at the results of 
sgmllib line for line. Maybe once I've mastered this I can put the definition 
into a method?

On a different note but to do with the fact that I have trouble understanding 
classes: there is a lexical inconsistency between class methods and standard 
functions isn't there?

In a function the number of arguments is the same in its definition as when 
it's called.

def does_little(something):
    print something

does_little('hi')
prints 'hi'

in a class I'd always have to have the magical 'self' in there as well 
wouldn't I?

class NewObject:
    def does_little(self, something):
        print something

x = NewObject('hi')
x.does_little('hi')
prints 'hi'

I don't know why but I just have trouble with this inconsistency. Is it 
really necessary?