[Python-Dev] sgmllib Comments

Sam Ruby rubys at intertwingly.net
Sun Jun 11 22:26:29 CEST 2006


Planet is a feed aggregator written in Python.  It depends heavily on 
SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib, 
and I've submitted a test case and a patch[1] (use or discard the patch, 
it is the test that I care about).

While looking around, a few things surfaced.  For starters, it would 
seem that the version of sgmllib in SVN HEAD will selectively unescape 
certain character references that might appear in an attribute.  I say 
selectively, as:

  * it will unescape  &
  * it won't unescape ©
  * it will unescape  &
  * it won't unescape &
  * it will unescape  ’
  * it won't unescape ’

There are a number of issues here.  While not unescaping anything is 
suboptimal, at least the recipient is aware of exactly which characters 
have been unescaped (i.e., none of them).  The proposed solution makes 
it impossible for the recipient to know which characters are unescaped, 
and which are original.  (Note: feeds often contain such abominations as 
© which the new code will treat indistinguishably from ©)

Additionally, there is a unicode issue here - one that is shared by 
handle_charref, but at least that method is overrideable.  If unescaping 
remains, do it for hex character references and for values greather than 
8-bits, i.e., use unichr instead of chr if the value is greater than 127.

- Sam Ruby

[1] http://tinyurl.com/j4a6n


More information about the Python-Dev mailing list