Help with parsing web page

RiGGa rigga at hasnomail.com
Tue Jun 15 15:15:51 CEST 2004


Miki Tebeka wrote:

> Hello RiGGa,
> 
>> Anyone?, I have found out I can use sgmllib but find the documentation is
>> not that clear, if anyone knows of a tutorial or howto it would be
>> appreciated.
> I'm not an expert but this is how I work:
> 
> You make a subclass of HTMLParser and override the callback functions.
> Usually I use only start_<TAB> end_<TAB> and handle_data.
> Since you don't know *when* each callback function is called you need to
> keep an internal state. It can be a simple variable or a stack if you
> want to deal with nested tags.
> 
> A short example:
> #!/usr/bin/env python
> 
> from htmllib import HTMLParser
> from formatter import NullFormatter
> 
> class TitleParser(HTMLParser):
>     def __init__(self):
>         HTMLParser.__init__(self, NullFormatter())
>         self.state = ""
>         self.data = ""
>     
>     def start_title(self, attrs):
>         self.state = "title"
>         self.data = ""
> 
>     def end_title(self):
>         print "Title:", self.data.strip()
> 
>     def handle_data(self, data):
>         if self.state:
>             self.data += data
> 
> if __name__ == "__main__":
>     from sys import argv
> 
>     parser = TitleParser()
>     parser.feed(open(argv[1]).read())
> 
> HTH.
> --
> -------------------------------------------------------------------------
> Miki Tebeka <miki.tebeka at zoran.com>
> The only difference between children and adults is the price of the toys.
Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense. 

Many thanks

Rigga



More information about the Python-list mailing list