URL listers
Peter Otten
__peter__ at web.de
Mon Nov 17 04:28:16 EST 2003
P. Daniell wrote:
> I have the following HTML document
>
> <html>
> <body>
> <a href="http://www.yahoo.com">I don't give a hoot</a>
> </body>
> </html>
>
> I want my HTMLParser subclass (code below) to output
>
> http://www.yahoo.com I don't give a hoot
>
> Instead it outputs
>
> http://www.yahoo.com I don
> http://www.yahoo.com '
> http://www.yahoo.com t give a hoot
>
>
> Would anyone care to give me some guidance on how to fix this?
handle_data() can be called multiple times inside <tag>...</tag>, so you
must collect the chunks (see the text attribute below) and only print them
in the anchor_end() method:
class URLLister(htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
self.in_a = 0
self.tempurl = ''
self.text = []
def anchor_bgn(self, href, name, type):
self.in_a = 1
self.tempurl = href
def anchor_end(self):
print self.tempurl, "".join(self.text)
del self.text[:]
self.in_a = 0
def handle_data(self, data):
if self.in_a:
self.text.append(data)
By the way, there is another HTMLParser in the HTMLParser module,
which I think is superior.
Peter
More information about the Python-list
mailing list