HTMLParser question

Benjamin Niemann b.niemann at
Thu Aug 19 17:51:08 CEST 2004

Rajarshi Guha wrote:
> Hi,
>   I have some HTML that looks essentially consists of a series of <div>'s
> and each <div> having one of two classes (tnt-question or tnt-answer).
> I'm using HTMLParser to handle the tags as:
> class MyHTMLParser(HTMLParser.HTMLParser):
>     def handle_starttag(self, tag, attrs):
>         if len(attrs) == 1:
>             cls,whichcls = attrs[0]
>             if whichcls == 'tnt-question':
>                 print self.get_starttag_text(), self.getpos()
>     def handle_endtag(self, tag):
>         pass
>     def handle_data(self, data):
>         print data
> if __name__ == '__main__':
>     htmldata = string.join(open('tt.html','r').readlines())
>     parser = MyHTMLParser()
>     parser.feed( htmldata )
> However what I would like is that when the parser reaches some HTML like
> this:
>         <div class="tnt-question">
>             How do I add a user to a MySQL system?
>         </div>
> I should get back the data between the open and close tags. However the
> above code prints the text contained between all tags, not just the <div>
> tags with the class='tnt-question'.
> Is there a way to call handle_data() when a specific tag is being handled?
> Placing a call to handle_data() in handle_starttag seems to be the way -
> but I';m not sure how to actually do it - what data should I pass to the
> call?
Set a flag, when you the parser calls handle_starttag() and the tag 
matches your criteria, unset it, when the corresponding endtag is found 
(you'll probably have to count the nesting depth, so for
<div class="printme">Yo <div>man</div>!</div>
the flag is unset on the second </div>). Then in handle_data() only 
print it, when the flag is set.

More information about the Python-list mailing list