URL listers

Peter Otten __peter__ at web.de
Mon Nov 17 04:28:16 EST 2003


P. Daniell wrote:

> I have the following HTML document
> 
> <html>
> <body>
> <a href="http://www.yahoo.com">I don't give a hoot</a>
> </body>
> </html>
> 
> I want my HTMLParser subclass (code below) to output
> 
> http://www.yahoo.com I don't give a hoot
> 
> Instead it outputs
> 
> http://www.yahoo.com I don
> http://www.yahoo.com  '
> http://www.yahoo.com t give a hoot
> 
> 
> Would anyone care to give me some guidance on how to fix this?

handle_data() can be called multiple times inside <tag>...</tag>, so you
must collect the chunks (see the text attribute below) and only print them
in the anchor_end() method:

class URLLister(htmllib.HTMLParser):
    def __init__(self):
        htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
        self.in_a = 0
        self.tempurl = ''
        self.text = []

    def anchor_bgn(self, href, name, type):
        self.in_a = 1
        self.tempurl = href

    def anchor_end(self):
        print self.tempurl, "".join(self.text)
        del self.text[:]
        self.in_a = 0

    def handle_data(self, data):
        if self.in_a:
            self.text.append(data)


By the way, there is another HTMLParser in the HTMLParser module,
which I think is superior.

Peter




More information about the Python-list mailing list