[Python-Dev] sgmllib Comments

Mon Jun 12 02:39:37 CEST 2006

On Sunday 11 June 2006 16:26, Sam Ruby wrote:
 > Planet is a feed aggregator written in Python.  It depends heavily on
 > SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib,
 > and I've submitted a test case and a patch[1] (use or discard the patch,
 > it is the test that I care about).

And it's a nice aggregator to use, indeed!

 > While looking around, a few things surfaced.  For starters, it would
 > seem that the version of sgmllib in SVN HEAD will selectively unescape
 > certain character references that might appear in an attribute.  I say
 > selectively, as:
 >
 >   * it will unescape  &amp;
 >   * it won't unescape &copy;
 >   * it will unescape  &#38;
 >   * it won't unescape &#x26;
 >   * it will unescape  &#146;
 >   * it won't unescape &#8217;

And just why would you use sgmllib to handle RSS or ATOM feeds?  Neither is 
defined in terms of SGML.  The sgmllib documentation also notes that it isn't 
really a fully general SGML parser (it isn't), but that it exists primarily 
as a foundation for htmllib.

 > There are a number of issues here.  While not unescaping anything is
 > suboptimal, at least the recipient is aware of exactly which characters
 > have been unescaped (i.e., none of them).  The proposed solution makes
 > it impossible for the recipient to know which characters are unescaped,
 > and which are original.  (Note: feeds often contain such abominations as
 > &amp;copy; which the new code will treat indistinguishably from &copy;)

My suspicion is that the "right" thing to do at the sgmllib level is to 
categorize the markup and call a method depending on what the entity 
reference is, and let that handle whatever it is.  For SGML, that means we 
have things like &name; (entity references), &#123; (character references), 
and that's it.  &#x123; isn't legal SGML under any circumstance; 
the "&#x<number>;" syntax was introduced with XML.

 > Additionally, there is a unicode issue here - one that is shared by
 > handle_charref, but at least that method is overrideable.  If unescaping
 > remains, do it for hex character references and for values greather than
 > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.

For SGML, it's worse than that, since the document character set is defined in 
the SGML declaration, which is a far hairier beast than an XML 
declaration.  :-)

It really sounds like sgmllib is the wrong foundation for this.  While the 
module has some questionable behaviors, none of them are signifcant in the 
context it's intended context (support for htmllib).  Now, I understand that 
RSS has historical issues, with HTML-as-practiced getting embedded as payload 
data with various flavors of escaping applied, and I'm not an expert in the 
details of that.  Have you looked at HTMLParser as an alternate to sgmllib?  
It has better support for XHTML constructs.

  -Fred

-- 
Fred L. Drake, Jr.   <fdrake at acm.org>