(htmllib) How to capture text that includes tags?
jennyw at dangerousideas.com
Wed Nov 5 07:34:10 CET 2003
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).
To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
This rectangle measures 7" x 3".
Drops the quotation marks, resulting in:
This rectangle mesaures 7 x 3.
I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?
This is relevant portion of the class I'm using so far:
def start_font(self, attrs):
text = self.save_end()
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text
# I needed to override save_end because it was having trouble
# when data was nothing.
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.
If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.
data = self.savedata
self.savedata = None
if not self.nofill:
data = ' '.join(data.split())
More information about the Python-list