(htmllib) How to capture text that includes tags?

Peter Otten __peter__ at web.de
Wed Nov 5 05:23:36 EST 2003


jennyw wrote:

> I'm trying to parse a product catalog written in HTML.  Some of the
> information I need are attributes of tags (like the product name, which
> is in an anchor). Some (like product description) are between tags
> (in the case of product description, the tag is font).
> 
> To capture product descriptions, I've been using the save_bgn() and
> save_end() methods. But I've noticed that the result of save_end() only
> includes text that isn't marked up.  For example, this product
> description:
> 
> <font size="1">
> This rectangle measures 7" x 3".
> </font>
> 
> Drops the quotation marks, resulting in:
> 
> This rectangle mesaures 7 x 3.
> 
> I've been looking through Google Groups but haven't found a way to get
> the markup in between the tags. Any suggestions?
> 
> This is relevant portion of the class I'm using so far:
> 
> class myHTMLParser(htmllib.HTMLParser):
> 
>     def __init__(self,f):
>         htmllib.HTMLParser.__init__(self, f)
>         self
> 
>     def start_font(self, attrs):
>         self.save_bgn()
> 
>     def end_font(self):
>         text = self.save_end()
>         if text:
>             if re.search("\\.\\s*$", text):
>                 print "Probably a product description: " + text
> 
> # I needed to override save_end because it was having trouble
> # when data was nothing.
> 
>     def save_end(self):
>         """Ends buffering character data and returns all data saved since
>         the preceding call to the save_bgn() method.
> 
>         If the nofill flag is false, whitespace is collapsed to single
>         spaces.  A call to this method without a preceding call to the
>         save_bgn() method will raise a TypeError exception.
> 
>         """
>         data = self.savedata
>         self.savedata = None
>         if data:
>             if not self.nofill:
>                 data = ' '.join(data.split())
>         return data
> 
> Thanks!
> 
> Jen

I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

import HTMLParser, htmlentitydefs

class CatalogParser(HTMLParser.HTMLParser):
    entitydefs = htmlentitydefs.entitydefs

    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.infont = False
        self.text = []

    def handle_starttag(self, tag, atts):
        if tag == "font":
            assert not self.infont
            self.infont = True

    def handle_entityref(self, name):
        if self.infont:
            self.handle_data(self.entitydefs.get(name, "?"))

    def handle_data(self, data):
        if self.infont:
            self.text.append(data)

    def handle_endtag(self, tag):
        if tag == "font":
            assert self.infont
            self.infont = False
            if self.text:
                print "".join(self.text)

data = """
<html>
<body>
    <h1>"Ignore me"</h1>
    <font size="1">
    This &wuerg; rectangle measures 7" x 3".
    </font>
</body>
</html>
"""
p = CatalogParser()
p.feed(data)
p.close()

Peter





More information about the Python-list mailing list