[Python-Dev] sgmllib Comments

Mon Jun 12 06:01:23 CEST 2006

Fred L. Drake, Jr. wrote:
> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
>  > Planet is a feed aggregator written in Python.  It depends heavily on
>  > SGMLLib.  A recent bug report turned out to be a deficiency in sgmllib,
>  > and I've submitted a test case and a patch[1] (use or discard the patch,
>  > it is the test that I care about).
> 
> And it's a nice aggregator to use, indeed!
> 
>  > While looking around, a few things surfaced.  For starters, it would
>  > seem that the version of sgmllib in SVN HEAD will selectively unescape
>  > certain character references that might appear in an attribute.  I say
>  > selectively, as:
>  >
>  >   * it will unescape  &amp;
>  >   * it won't unescape &copy;
>  >   * it will unescape  &#38;
>  >   * it won't unescape &#x26;
>  >   * it will unescape  &#146;
>  >   * it won't unescape &#8217;
> 
> And just why would you use sgmllib to handle RSS or ATOM feeds?  Neither is 
> defined in terms of SGML.  The sgmllib documentation also notes that it isn't 
> really a fully general SGML parser (it isn't), but that it exists primarily 
> as a foundation for htmllib.

The feed itself is read first with SAX (then with a fallback using 
sgmllib if the feed is not well formed, but that's beside the point). 
Then the embedded HTML portions are then processed with subclasses of 
sgmllib.

>  > There are a number of issues here.  While not unescaping anything is
>  > suboptimal, at least the recipient is aware of exactly which characters
>  > have been unescaped (i.e., none of them).  The proposed solution makes
>  > it impossible for the recipient to know which characters are unescaped,
>  > and which are original.  (Note: feeds often contain such abominations as
>  > &amp;copy; which the new code will treat indistinguishably from &copy;)
> 
> My suspicion is that the "right" thing to do at the sgmllib level is to 
> categorize the markup and call a method depending on what the entity 
> reference is, and let that handle whatever it is.  For SGML, that means we 
> have things like &name; (entity references), &#123; (character references), 
> and that's it.  &#x123; isn't legal SGML under any circumstance; 
> the "&#x<number>;" syntax was introduced with XML.

... but it effectively is valid HTML.  And as you point out below 
sgmllib's raison d’être is to support htmllib.

>  > Additionally, there is a unicode issue here - one that is shared by
>  > handle_charref, but at least that method is overrideable.  If unescaping
>  > remains, do it for hex character references and for values greather than
>  > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
> 
> For SGML, it's worse than that, since the document character set is defined in 
> the SGML declaration, which is a far hairier beast than an XML 
> declaration.  :-)

understood

> It really sounds like sgmllib is the wrong foundation for this.  While the 
> module has some questionable behaviors, none of them are signifcant in the 
> context it's intended context (support for htmllib).  Now, I understand that 
> RSS has historical issues, with HTML-as-practiced getting embedded as payload 
> data with various flavors of escaping applied, and I'm not an expert in the 
> details of that.  Have you looked at HTMLParser as an alternate to sgmllib?  
> It has better support for XHTML constructs.

HTMLParser is less forgiving, and generally less suitable for consuming 
HTML as practiced.

- Sam Ruby