Looking for code which allows easy extraction of text from HTML

Joe Francia usenet at soraia.com
Wed Mar 5 18:55:59 CET 2003


Use the SGMLParser in sgmllib, as it's slightly easier to use.  Define a 
start_<tagname> method for each <tagname> you will parse, and 
handle_data(self, data) is called for all text between tags.  The 
following example extracts the text of each anchor in the Google start page:

from sgmllib import SGMLParser
import urllib

class ParseMe(SGMLParser):

     def __init__(self):
         SGMLParser.__init__(self)
         self.in_a = 0

     def start_a(self, attr):
         self.in_a = 1
         print '<',

     def end_a(self):
         self.in_a = 0
         print '>'

     def handle_data(self, data):
         if self.in_a:
             print data,

if __name__ == '__main__':
     ht = ParseMe()
     ht.feed(urllib.urlopen('http://www.google.com/').read())
     ht.close()


Grzegorz Adam Hankiewicz wrote:
> Hello.
> 
> I need to parse a few HTML pages which contain information. These
> pages were generated from a database and thus have a common HTML code
> structure. Is there a package which extracts text given a condition?
> I would need a re-like module for HTML code. I have thought of
> transforming the HTML to XML with HTMLParser and use minidom
> to extract the text with a few recursive text node extraction
> functions. Is there a better way?
> 





More information about the Python-list mailing list