Trouble with htmllib.HTMLParser
Fredrik Lundh
fredrik at effbot.org
Sun Nov 12 05:42:03 EST 2000
Jeremy Fincher wrote:
> I've used HTML parsing libraries in other languages (read: Perl) and
> I've always simply inherited from an HTML Parsing class, and overridden
> the functions that interest me. I'm not having as easy a time in
> python; one thing I've have particular trouble with in reading the
> documentation for htmllib.HTMLParser is finding out how CDATA (ie, the
> stuff between the start and end tags) is passed to my class.
>
> Do I have to use a formatter with HTMLParser? I'm not planning on
> actually outputting anything; it's mostly to enter information into a
> database.
if that's the case, use sgmllib.SGMLParser instead.
> Are there any resources/example code other than the Library Reference?
> I haven't been able to find any.
here's one:
# sgmllib-example-1.py
# from (the eff-bot guide to) The Python Standard Library
# http://www.pythonware.com/people/fredrik/librarybook.htm
import sgmllib
import string
class FoundTitle(Exception):
pass
class ExtractTitle(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.title = self.data = None
def handle_data(self, data):
if self.data is not None:
self.data.append(data)
def start_title(self, attrs):
self.data = []
def end_title(self):
self.title = string.join(self.data, "")
raise FoundTitle # abort parsing!
def extract(file):
# extract title from an HTML/SGML stream
p = ExtractTitle()
try:
while 1:
# read small chunks
s = file.read(512)
if not s:
break
p.feed(s)
p.close()
except FoundTitle:
return p.title
return None
#
# try it out
print "html", "=>", extract(open("samples/sample.htm"))
print "sgml", "=>", extract(open("samples/sample.sgm"))
## html => A Title.
## sgml => Quotations
</F>
<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->
More information about the Python-list
mailing list