[Tutor] problems with HTMLParser

Sean 'Shaleh' Perry shaleh@valinux.com
Wed, 24 Jan 2001 15:48:13 -0800 (PST)


On 24-Jan-2001 Sean 'Shaleh' Perry wrote:
>> 
>> About the line justification; I think that one of the regular formatters
>> will actually do line justification for you, but I haven't looked too
>> closely into them yet.  Take a look at DumbWriter:
>> 
>>     http://python.org/doc/current/lib/writer-interface.html
>> 
>> and see if it's suitable for your program.
>> 
>> 
> 
> I was not clear on the last part.
> 
> <a href="http://example.com">Go to example.com</a>
> 
> How do I get the text "Go to example.com"?
> 

/me often forgets how easy python source is to read (-:

reading sgmllib/htmllib I find that I need to define a handle_data() function.

So, my TOC reader is now working, yay (-:  Need to clean up the output and some
other sundries, but below is a good idea of what is happening.

Suggestions on code improvements and python idioms welcome.

class myHTML(HTMLParser):
    def __init__(self, formatter, verbose = 0):
        HTMLParser.__init__(self, formatter, verbose)
        self.found_anchor = 0
        self.want_anchor = 0
        self.tmp_url = []
        self.urls = []
                      
    def start_ul(self, attributes):
        pass
            
    def end_ul(self):
        self.want_anchor = 0
                            
    def start_li(self, attributes):
        self.want_anchor = 1
                            
    def end_li(self):       
        self.want_anchor = 0
                            
    def start_a(self, attributes):
        if self.want_anchor:
            self.tmp_url.append(attributes[0][1])
            self.found_anchor = 1
                                 
    def end_a(self):             
        if self.found_anchor:
            self.found_anchor = 0
            self.tmp_url = []
                             
    def handle_data(self, data):
        if self.found_anchor:
            self.tmp_url.append(data)
            self.urls.append(self.tmp_url)

if __name__ == '__main__':
    file = '/usr/share/doc/debian-policy/policy.html/index.html'
    fp = open(file)                                             
    data = fp.read()
    fp.close()      
    foo = myHTML(NullFormatter())
    foo.feed(data)               
    foo.close()   
    for url in foo.urls:
       print "%s => %s" % (url[0], url[1])