(htmllib) How to capture text that includes tags?

jennyw jennyw at dangerousideas.com
Wed Nov 5 07:34:10 CET 2003


I'm trying to parse a product catalog written in HTML.  Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only 
includes text that isn't marked up.  For example, this product 
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

	This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get 
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(htmllib.HTMLParser):

    def __init__(self,f):
        htmllib.HTMLParser.__init__(self, f)
        self

    def start_font(self, attrs):
        self.save_bgn()

    def end_font(self):
        text = self.save_end()
        if text:
            if re.search("\\.\\s*$", text):
                print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

    def save_end(self):
        """Ends buffering character data and returns all data saved since
        the preceding call to the save_bgn() method.

        If the nofill flag is false, whitespace is collapsed to single
        spaces.  A call to this method without a preceding call to the
        save_bgn() method will raise a TypeError exception.

        """
        data = self.savedata
        self.savedata = None
        if data:
            if not self.nofill:
                data = ' '.join(data.split())
        return data

Thanks!

Jen





More information about the Python-list mailing list