understanding htmllib
Fredrik Lundh
fredrik at pythonware.com
Wed Oct 4 03:06:06 EDT 2006
David Bear wrote:
> I'm trying to understand how to use the HTMLParser in htmllib but I'm not
> seeing enough examples.
>
> I just want to grab the contents of everything enclosed in a '<body>' tag,
> i.e. items from where <body> begins to where </body> ends. I start by doing
>
> class HTMLBody(HTMLParser):
> def __init__(self):
> self.contents = []
>
> def handle_starttag()..
>
> Now I'm stuck. I cant see that there is a method on handle_starttag that
> would return everthing to the end tag. And I haven't seen anything on how
> to define my one handle_unknowntag..
htmllib is designed to be used together with a formatting object. if
you just want to work with tags, use sgmllib instead. some variation of
the SGMLFilter example on this page might be what you need:
http://effbot.org/librarybook/sgmllib.htm
if you want a DOM-like structure instead of an event stream, use
http://www.crummy.com/software/BeautifulSoup/
usage:
>>> import BeautifulSoup as BS
>>> soup = BS.BeautifulSoup(open("page.html"))
>>> str(soup.body)
'<body>\n<h1>Body Title</h1>\n<p>Paragraph</p>\n</body>'
>>> soup.body.renderContents()
'\n<h1>Body Title</h1>\n<p>Paragraph</p>\n'
</F>
More information about the Python-list
mailing list