Trouble with htmllib.HTMLParser

Sun Nov 12 05:42:03 EST 2000

Jeremy Fincher wrote:
> I've used HTML parsing libraries in other languages (read: Perl) and
> I've always simply inherited from an HTML Parsing class, and overridden
> the functions that interest me.  I'm not having as easy a time in
> python; one thing I've have particular trouble with in reading the
> documentation for htmllib.HTMLParser is finding out how CDATA (ie, the
> stuff between the start and end tags) is passed to my class.
>
> Do I have to use a formatter with HTMLParser?  I'm not planning on
> actually outputting anything; it's mostly to enter information into a
> database.

if that's the case, use sgmllib.SGMLParser instead.

> Are there any resources/example code other than the Library Reference?
> I haven't been able to find any.

here's one:

# sgmllib-example-1.py
# from (the eff-bot guide to) The Python Standard Library
# http://www.pythonware.com/people/fredrik/librarybook.htm

import sgmllib
import string

class FoundTitle(Exception):
    pass

class ExtractTitle(sgmllib.SGMLParser):

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.title = self.data = None

    def handle_data(self, data):
        if self.data is not None:
            self.data.append(data)

    def start_title(self, attrs):
        self.data = []

    def end_title(self):
        self.title = string.join(self.data, "")
        raise FoundTitle # abort parsing!

def extract(file):
    # extract title from an HTML/SGML stream
    p = ExtractTitle()
    try:
        while 1:
            # read small chunks
            s = file.read(512)
            if not s:
                break
            p.feed(s)
        p.close()
    except FoundTitle:
        return p.title
    return None

#
# try it out

print "html", "=>", extract(open("samples/sample.htm"))
print "sgml", "=>", extract(open("samples/sample.sgm"))

## html => A Title.
## sgml => Quotations

</F>

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->