[Python-Dev] sgmllib Comments

Mon Jun 12 07:06:45 CEST 2006

Sam Ruby wrote:
> Planet is a feed aggregator written in Python.  It depends heavily on 
> SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib, 
> and I've submitted a test case and a patch[1] (use or discard the patch, 
> it is the test that I care about).

I think (but am not sure) you are referring to patch #1462498 here,
which fixes bugs 1452246 and 1087808.

>   * it will unescape  &amp;
>   * it won't unescape &copy;

That must be because you have amp in your entitydefs, but not copy.

>   * it will unescape  &#38;
>   * it won't unescape &#x26;

That's because it doesn't recognize hex character references.
That's systematic, though: it doesn't just ignore them in attribute
values, but also in content.

>   * it will unescape  &#146;
>   * it won't unescape &#8217;

That's because the value is larger than 256, so chr() fails.

> There are a number of issues here.  While not unescaping anything is 
> suboptimal, at least the recipient is aware of exactly which characters 
> have been unescaped (i.e., none of them).  The proposed solution makes 
> it impossible for the recipient to know which characters are unescaped, 
> and which are original.  (Note: feeds often contain such abominations as 
> &amp;copy; which the new code will treat indistinguishably from &copy;)

The recipient should then add &copy; to entitydefs; sgmllib will
unescape copy, so the recipient can know not to unescape that.

Alternatively, the recipient could provide an empty entitydefs.

> Additionally, there is a unicode issue here - one that is shared by 
> handle_charref, but at least that method is overrideable.  If unescaping 
> remains, do it for hex character references and for values greather than 
> 8-bits, i.e., use unichr instead of chr if the value is greater than 127.

Alternatively, a callback function could be provided for character
references. Unfortunately, the existing callback is unsuitable,
as it is supposed to do the full processing; this callback should
return the replacement text. Generally assuming Unicode would be
wrong, though.

Would you like to contribute a patch?

Regards,
Martin