(htmllib) How to capture text that includes tags?
jennyw
jennyw at dangerousideas.com
Wed Nov 5 01:34:10 EST 2003
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).
To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:
<font size="1">
This rectangle measures 7" x 3".
</font>
Drops the quotation marks, resulting in:
This rectangle mesaures 7 x 3.
I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?
This is relevant portion of the class I'm using so far:
class myHTMLParser(htmllib.HTMLParser):
def __init__(self,f):
htmllib.HTMLParser.__init__(self, f)
self
def start_font(self, attrs):
self.save_bgn()
def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text
# I needed to override save_end because it was having trouble
# when data was nothing.
def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.
If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.
"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.split())
return data
Thanks!
Jen
More information about the Python-list
mailing list