[Tutor] Help with Parsing HTML files
Charlie Clark <email@example.com>
Mon, 06 Aug 2001 10:04:03 +0200
>Maybe an example will help --- Here's a small example that, given a web
>site, tries to pull out all the image names. (Rob, here's another useless
>python script. *grin*)
>This example will use htmllib to help us "parse" and hunt down IMG tags.
>I don't think we need to explicitly rewrite handle_image(). For HTML
>elements that have start and end tags, let's define
>"start_nameofsometag()" and "end_nameofsometag()" methods. However, since
>an IMG tag stands alone, we'll write a do_img() method instead.
> def __init__(self):
> self.list_of_images = 
> def do_img(self, attributes):
> for name, value in attributes:
> if name == 'src':
> new_image = value
> def getImageList(self):
> return self.list_of_images
>if __name__ == '__main__':
> url = sys.argv
> url_contents = urllib.urlopen(url).read()
> puller = ImagePuller()
> print puller.getImageList()
>For more information about this, take a look at:
I've looked at this but as it doesn't come with examples I'm stumped.
If do_img is a method of ImagePuller, when is it called and how does it look
for images? It seems to do this via tag analysis where a tag contains "src"
which won't just be images but any inline element (object, script, frame,
etc.) and how do I overwrite or modify it to work with specific tags together
with specific content. The pages I'm pulling in are "contaminated" - HTML and
content are horribly mixed so that I can't depend on looking at specific
My current script looks for a specific table beginning
<table border = "0" cellpadding = "0"
and then <b>Runde</b>
but the next one I have to write looks for a table cell and has a different
I can parse the files with htmllib and sgmllib. I think sgmllib is probably
more useful as I can check each tab and start "buffering" (I think that's the
appropriate term) content if necessary providing I look at the results of
sgmllib line for line. Maybe once I've mastered this I can put the definition
into a method?
On a different note but to do with the fact that I have trouble understanding
classes: there is a lexical inconsistency between class methods and standard
functions isn't there?
In a function the number of arguments is the same in its definition as when
in a class I'd always have to have the magical 'self' in there as well
def does_little(self, something):
x = NewObject('hi')
I don't know why but I just have trouble with this inconsistency. Is it