(htmllib) How to capture text that includes tags?
Peter Otten
__peter__ at web.de
Wed Nov 5 05:23:36 EST 2003
jennyw wrote:
> I'm trying to parse a product catalog written in HTML. Some of the
> information I need are attributes of tags (like the product name, which
> is in an anchor). Some (like product description) are between tags
> (in the case of product description, the tag is font).
>
> To capture product descriptions, I've been using the save_bgn() and
> save_end() methods. But I've noticed that the result of save_end() only
> includes text that isn't marked up. For example, this product
> description:
>
> <font size="1">
> This rectangle measures 7" x 3".
> </font>
>
> Drops the quotation marks, resulting in:
>
> This rectangle mesaures 7 x 3.
>
> I've been looking through Google Groups but haven't found a way to get
> the markup in between the tags. Any suggestions?
>
> This is relevant portion of the class I'm using so far:
>
> class myHTMLParser(htmllib.HTMLParser):
>
> def __init__(self,f):
> htmllib.HTMLParser.__init__(self, f)
> self
>
> def start_font(self, attrs):
> self.save_bgn()
>
> def end_font(self):
> text = self.save_end()
> if text:
> if re.search("\\.\\s*$", text):
> print "Probably a product description: " + text
>
> # I needed to override save_end because it was having trouble
> # when data was nothing.
>
> def save_end(self):
> """Ends buffering character data and returns all data saved since
> the preceding call to the save_bgn() method.
>
> If the nofill flag is false, whitespace is collapsed to single
> spaces. A call to this method without a preceding call to the
> save_bgn() method will raise a TypeError exception.
>
> """
> data = self.savedata
> self.savedata = None
> if data:
> if not self.nofill:
> data = ' '.join(data.split())
> return data
>
> Thanks!
>
> Jen
I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.
import HTMLParser, htmlentitydefs
class CatalogParser(HTMLParser.HTMLParser):
entitydefs = htmlentitydefs.entitydefs
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.infont = False
self.text = []
def handle_starttag(self, tag, atts):
if tag == "font":
assert not self.infont
self.infont = True
def handle_entityref(self, name):
if self.infont:
self.handle_data(self.entitydefs.get(name, "?"))
def handle_data(self, data):
if self.infont:
self.text.append(data)
def handle_endtag(self, tag):
if tag == "font":
assert self.infont
self.infont = False
if self.text:
print "".join(self.text)
data = """
<html>
<body>
<h1>"Ignore me"</h1>
<font size="1">
This &wuerg; rectangle measures 7" x 3".
</font>
</body>
</html>
"""
p = CatalogParser()
p.feed(data)
p.close()
Peter
More information about the Python-list
mailing list