Help with parsing web page
RiGGa
rigga at hasnomail.com
Sat Jun 19 04:01:03 EDT 2004
RiGGa wrote:
> Miki Tebeka wrote:
>
>> Hello RiGGa,
>>
>>> Anyone?, I have found out I can use sgmllib but find the documentation
>>> is not that clear, if anyone knows of a tutorial or howto it would be
>>> appreciated.
>> I'm not an expert but this is how I work:
>>
>> You make a subclass of HTMLParser and override the callback functions.
>> Usually I use only start_<TAB> end_<TAB> and handle_data.
>> Since you don't know *when* each callback function is called you need to
>> keep an internal state. It can be a simple variable or a stack if you
>> want to deal with nested tags.
>>
>> A short example:
>> #!/usr/bin/env python
>>
>> from htmllib import HTMLParser
>> from formatter import NullFormatter
>>
>> class TitleParser(HTMLParser):
>> def __init__(self):
>> HTMLParser.__init__(self, NullFormatter())
>> self.state = ""
>> self.data = ""
>>
>> def start_title(self, attrs):
>> self.state = "title"
>> self.data = ""
>>
>> def end_title(self):
>> print "Title:", self.data.strip()
>>
>> def handle_data(self, data):
>> if self.state:
>> self.data += data
>>
>> if __name__ == "__main__":
>> from sys import argv
>>
>> parser = TitleParser()
>> parser.feed(open(argv[1]).read())
>>
>> HTH.
>> --
>> -------------------------------------------------------------------------
>> Miki Tebeka <miki.tebeka at zoran.com>
>> The only difference between children and adults is the price of the toys.
> Thanks for taking the time to help its appreciated, I am new to Python so
> a little confused with what you have posted however I will go through it
> again and se if it makes more sense.
>
> Many thanks
>
> Rigga
Said I would be back :)
How do I get the current position (offset) which I am at in the file?
I have tried getpos() and variations thereof and keep getting syntax
errors...
Thanks
R
More information about the Python-list
mailing list