Help with parsing web page
Miki Tebeka
miki.tebeka at zoran.com
Tue Jun 15 06:18:17 EDT 2004
Hello RiGGa,
> Anyone?, I have found out I can use sgmllib but find the documentation is
> not that clear, if anyone knows of a tutorial or howto it would be
> appreciated.
I'm not an expert but this is how I work:
You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.
A short example:
#!/usr/bin/env python
from htmllib import HTMLParser
from formatter import NullFormatter
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""
def start_title(self, attrs):
self.state = "title"
self.data = ""
def end_title(self):
print "Title:", self.data.strip()
def handle_data(self, data):
if self.state:
self.data += data
if __name__ == "__main__":
from sys import argv
parser = TitleParser()
parser.feed(open(argv[1]).read())
HTH.
--
-------------------------------------------------------------------------
Miki Tebeka <miki.tebeka at zoran.com>
The only difference between children and adults is the price of the toys.
More information about the Python-list
mailing list