[Python-Dev] sgmllib Comments
"Martin v. Löwis"
martin at v.loewis.de
Mon Jun 12 07:06:45 CEST 2006
Sam Ruby wrote:
> Planet is a feed aggregator written in Python. It depends heavily on
> SGMLLib. A recent bug report turned out to be a deficiency in sgmllib,
> and I've submitted a test case and a patch[1] (use or discard the patch,
> it is the test that I care about).
I think (but am not sure) you are referring to patch #1462498 here,
which fixes bugs 1452246 and 1087808.
> * it will unescape &
> * it won't unescape ©
That must be because you have amp in your entitydefs, but not copy.
> * it will unescape &
> * it won't unescape &
That's because it doesn't recognize hex character references.
That's systematic, though: it doesn't just ignore them in attribute
values, but also in content.
> * it will unescape ’
> * it won't unescape ’
That's because the value is larger than 256, so chr() fails.
> There are a number of issues here. While not unescaping anything is
> suboptimal, at least the recipient is aware of exactly which characters
> have been unescaped (i.e., none of them). The proposed solution makes
> it impossible for the recipient to know which characters are unescaped,
> and which are original. (Note: feeds often contain such abominations as
> © which the new code will treat indistinguishably from ©)
The recipient should then add © to entitydefs; sgmllib will
unescape copy, so the recipient can know not to unescape that.
Alternatively, the recipient could provide an empty entitydefs.
> Additionally, there is a unicode issue here - one that is shared by
> handle_charref, but at least that method is overrideable. If unescaping
> remains, do it for hex character references and for values greather than
> 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
Alternatively, a callback function could be provided for character
references. Unfortunately, the existing callback is unsuitable,
as it is supposed to do the full processing; this callback should
return the replacement text. Generally assuming Unicode would be
wrong, though.
Would you like to contribute a patch?
Regards,
Martin
More information about the Python-Dev
mailing list