(htmllib) How to capture text that includes tags?

Mathias Waack M.Waack at gmx.de
Wed Nov 5 03:21:11 EST 2003


jennyw wrote:

> I'm trying to parse a product catalog written in HTML.  Some of the
> information I need are attributes of tags (like the product name,
> which is in an anchor). Some (like product description) are between
> tags (in the case of product description, the tag is font).
> 
> To capture product descriptions, I've been using the save_bgn() and
> save_end() methods. But I've noticed that the result of save_end()
> only
> includes text that isn't marked up.  For example, this product
> description:
> 
> <font size="1">
> This rectangle measures 7" x 3".
> </font>
> 
> Drops the quotation marks, resulting in:
> 
> This rectangle mesaures 7 x 3.

And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?

If you want to escape special characters you can use
xml.sax.saxutils.escape() or just write your own function (escape is
only a two liner). 

Mathias




More information about the Python-list mailing list