(htmllib) How to capture text that includes tags?
Mathias Waack
M.Waack at gmx.de
Wed Nov 5 03:21:11 EST 2003
jennyw wrote:
> I'm trying to parse a product catalog written in HTML. Some of the
> information I need are attributes of tags (like the product name,
> which is in an anchor). Some (like product description) are between
> tags (in the case of product description, the tag is font).
>
> To capture product descriptions, I've been using the save_bgn() and
> save_end() methods. But I've noticed that the result of save_end()
> only
> includes text that isn't marked up. For example, this product
> description:
>
> <font size="1">
> This rectangle measures 7" x 3".
> </font>
>
> Drops the quotation marks, resulting in:
>
> This rectangle mesaures 7 x 3.
And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?
If you want to escape special characters you can use
xml.sax.saxutils.escape() or just write your own function (escape is
only a two liner).
Mathias
More information about the Python-list
mailing list