SGML parsing tags and leeping track

hapaboy2059 at gmail.com hapaboy2059 at gmail.com
Tue May 2 00:12:03 EDT 2006


Hello,

I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.

In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used)  and the linked
text.

I need help in getting past the first steps.  I already have this basic
program to return hyperlinks.  I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....

very frustrated and help is appreciated!!!!!



--------------------------------------------------------------------------
import sgmllib, urllib

class HtmParser(sgmllib.SGMLParser):
    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.descriptions = []
        self.inside_a_element = 0

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.hyperlinks.append(value)

    def get_hyperlinks(self):
        "Return the list of hyperlinks."

        return self.hyperlinks


parser = HtmParser()

inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs

content = urllib.urlopen(inptAdrs)

bufff = content.read()
print 'Statistics for ', inptAdrs

print 'There is', len(bufff), 'characters in the web page'

parser.feed(bufff)


print parser.get_hyperlinks()
parser.close()


---------------------------------------------------------------------------------

any help is much appreciated




More information about the Python-list mailing list