HTML data extraction?

djw dwelch91.nospam at
Mon Dec 22 21:00:08 CET 2003

I don't know if there is anything at a higher level (I guess a Google 
session would tell you that), but doing what you describe with the 
HTMLParser module is very straightforward. All you have to do is keep 
some state flags in the derived HTMLParser class that indicate the 
found/not-found state of what you are looking for and have that control 
the collection of data between the flags.

Starting with the example in the docs, and adding some (untested) additions:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

     def __init__( self ):
         HTMLParser.__init__( self )
         self.in_bold_tag = False
         self.in_list_tag = False
         self.data_in_bold_list = ''

     def handle_starttag(self, tag, attrs):
         print "Encountered the beginning of a %s tag" % tag
         if tag == 'b': self.in_bold_tag = True
         if tag == 'li' : self.in_list_tag = True

     def handle_endtag(self, tag):
         print "Encountered the end of a %s tag" % tag
         if tag == 'b': self.in_bold_tag = False
         if tag == 'li' : self.in_list_tag = False

     def handle_data( self, data ):
         if self.in_bold_tag and self.in_list_tag:
             self.data_in_bold_list = ''.join( [ self.data_in_bold_list, 
data ] )

This is just an outline, but you get the idea...


Dave Kuhlman wrote:
> I recently read an article by Jon Udell about extracting data from
> Web pages as a poor person's Web services.  So, I have a question:
> Is there any Python support for finding and extracting information
> from HTML documents.
> I'd like something that would do things like the following:
> - return the data which is inside a <b> tag which is inside a
>   <li> tag.
> - return the data which is inside a <a> tag that has attribute
>   href="".
> - Etc.
> It would be a sort of structured grep for HTML.
> I've found the HTMLParser and htmllib modules in the Python
> standard library, but I'm wondering if there is anything at a
> higher level.
> Web searches did not turn up anything interesting.
> Thanks for help.
> Dave

More information about the Python-list mailing list