[Tutor] Help with Parsing HTML files
Charlie Clark
Charlie Clark <charlie@begeistert.org>
Mon, 06 Aug 2001 10:04:03 +0200
>Maybe an example will help --- Here's a small example that, given a web
>site, tries to pull out all the image names. (Rob, here's another useless
>python script. *grin*)
>
>This example will use htmllib to help us "parse" and hunt down IMG tags.
>I don't think we need to explicitly rewrite handle_image(). For HTML
>elements that have start and end tags, let's define
>"start_nameofsometag()" and "end_nameofsometag()" methods. However, since
>an IMG tag stands alone, we'll write a do_img() method instead.
>
>
>
>###
>import htmllib
>import formatter
>import sys
>import urllib
>
>class ImagePuller(htmllib.HTMLParser):
> def __init__(self):
> htmllib.HTMLParser.__init__(self,
> formatter.NullFormatter())
> self.list_of_images = []
>
> def do_img(self, attributes):
> for name, value in attributes:
> if name == 'src':
> new_image = value
> self.list_of_images.append(new_image)
>
> def getImageList(self):
> return self.list_of_images
>
>if __name__ == '__main__':
> url = sys.argv[1]
> url_contents = urllib.urlopen(url).read()
> puller = ImagePuller()
> puller.feed(url_contents)
> print puller.getImageList()
>###
>
>
>For more information about this, take a look at:
>
> http://python.org/doc/current/lib/module-sgmllib.html
I've looked at this but as it doesn't come with examples I'm stumped.
If do_img is a method of ImagePuller, when is it called and how does it look
for images? It seems to do this via tag analysis where a tag contains "src"
which won't just be images but any inline element (object, script, frame,
etc.) and how do I overwrite or modify it to work with specific tags together
with specific content. The pages I'm pulling in are "contaminated" - HTML and
content are horribly mixed so that I can't depend on looking at specific
tabs.
My current script looks for a specific table beginning
<table border = "0" cellpadding = "0"
and then <b>Runde</b>
but the next one I have to write looks for a table cell and has a different
separator.
I can parse the files with htmllib and sgmllib. I think sgmllib is probably
more useful as I can check each tab and start "buffering" (I think that's the
appropriate term) content if necessary providing I look at the results of
sgmllib line for line. Maybe once I've mastered this I can put the definition
into a method?
On a different note but to do with the fact that I have trouble understanding
classes: there is a lexical inconsistency between class methods and standard
functions isn't there?
In a function the number of arguments is the same in its definition as when
it's called.
def does_little(something):
print something
does_little('hi')
prints 'hi'
in a class I'd always have to have the magical 'self' in there as well
wouldn't I?
class NewObject:
def does_little(self, something):
print something
x = NewObject('hi')
x.does_little('hi')
prints 'hi'
I don't know why but I just have trouble with this inconsistency. Is it
really necessary?