HTMLLib.py use

Fredrik Lundh fredrik at pythonware.com
Wed May 5 03:15:08 EDT 1999


Matthew Cepl <cepl at fpm.cz> wrote:
> OK, not it's much better, there is no error message. But still, there is no
> output from the script. I would like to get just description in metatag
> DESCRIPTION of given HTML page.

you're close.  but your code cleared the description
attribute for each new meta tag, which means that
you could miss the description.  and it also assumed
that the meta attributes were given in a predefined
order (which is not always the case):

see below for a fixed version.

> BTW, when I shall need a content of TITLE element it should be done
> via start_title() or how?

htmllib does that for you; just look in the title attribute.

</F>

from htmllib import HTMLParser
from string import lower
from htmlentitydefs import entitydefs
import string, sys
import formatter

class WPage(HTMLParser):

    def __init__(self, verbose=0):
        self.description = ""
        HTMLParser.__init__(self, formatter.NullFormatter(), verbose)

    def do_meta(self, attributes):
        name = content = ""
        for key, value in attributes:
            if key == "name":
                name = value
            elif key == "content":
                content = value
        if string.lower(name) == "description":
            self.description = content

    def close(self):
        HTMLParser.close(self)

def test(args = None):

    try:
       f = open('test.htm', 'r')
    except IOError, msg:
       print file, ":", msg
       sys.exit(1)
    data = f.read()
    x = WPage()
    x.feed(data)
    print "description =", x.description
    print "title =", x.title
    x.close()

if __name__ == '__main__':
    test()






More information about the Python-list mailing list