[Tutor] parsing html.

Alan Gauld alan.gauld at btinternet.com
Wed Jan 16 09:40:09 CET 2008


"Shriphani Palakodety" <shriphanip at gmail.com> wrote in

> I have a html document here which goes like this:
>
> <A name=4></a><b>Table of Contents</b>
> .........
> <A name=5></a><b>Preface</b>
>
> Can someone tell me how I can get the string between the <b> tag for
> an a tag for a given value of the name attribute.

Heres an example using the standard library HTML parser
(from an unfinished topic in tutorial...). You could also
use BeautifulSoup and I recommend that if your needs get
any more complex...

----------------------------------------------
In practice we usually want to extract more specific data from a page, 
maybe the content of a particular row in a table or similar. For that 
we need to use the handle_starttag() and handle_endtag() methods. As 
an example let's extract the text of the second H1 level header:
html = '''
<html><head><title>Test page</title></head>
<body>
<center>
<h1>Here is the first heading</h1>
</center>
<p>A short paragraph
<h1>A second heading</h1>
<p>A paragraph containing a
<a href="www.google.com">hyperlink to google</a>
</body></html>
'''

from HTMLParser import HTMLParser

class H1Parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.h1_count = 0
        self.isHeading = False

    def handle_starttag(self,tag,attributes=None):
        if tag == 'h1':
            self.h1_count += 1
            self.isHeading = True

    def handle_endtag(self,tag):
        if tag == 'h1':
            self.isHeading = False

    def handle_data(self,data):
        if self.isHeading and self.h1_count == 2:
            print "Second Header contained: ", data

parser = H1Parser()
parser.feed(html)
parser.close()
------------------------------Hopefully you can see how to alter that 
pattern to suit your scenario.-- Alan GauldAuthor of the Learn to 
Program web sitehttp://www.freenetpages.co.uk/hp/alan.gauld 




More information about the Tutor mailing list