[Python-Dev] sgmllib Comments
Sam Ruby
rubys at intertwingly.net
Mon Jun 12 06:01:23 CEST 2006
Fred L. Drake, Jr. wrote:
> On Sunday 11 June 2006 16:26, Sam Ruby wrote:
> > Planet is a feed aggregator written in Python. It depends heavily on
> > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib,
> > and I've submitted a test case and a patch[1] (use or discard the patch,
> > it is the test that I care about).
>
> And it's a nice aggregator to use, indeed!
>
> > While looking around, a few things surfaced. For starters, it would
> > seem that the version of sgmllib in SVN HEAD will selectively unescape
> > certain character references that might appear in an attribute. I say
> > selectively, as:
> >
> > * it will unescape &
> > * it won't unescape ©
> > * it will unescape &
> > * it won't unescape &
> > * it will unescape ’
> > * it won't unescape ’
>
> And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is
> defined in terms of SGML. The sgmllib documentation also notes that it isn't
> really a fully general SGML parser (it isn't), but that it exists primarily
> as a foundation for htmllib.
The feed itself is read first with SAX (then with a fallback using
sgmllib if the feed is not well formed, but that's beside the point).
Then the embedded HTML portions are then processed with subclasses of
sgmllib.
> > There are a number of issues here. While not unescaping anything is
> > suboptimal, at least the recipient is aware of exactly which characters
> > have been unescaped (i.e., none of them). The proposed solution makes
> > it impossible for the recipient to know which characters are unescaped,
> > and which are original. (Note: feeds often contain such abominations as
> > © which the new code will treat indistinguishably from ©)
>
> My suspicion is that the "right" thing to do at the sgmllib level is to
> categorize the markup and call a method depending on what the entity
> reference is, and let that handle whatever it is. For SGML, that means we
> have things like &name; (entity references), { (character references),
> and that's it. ģ isn't legal SGML under any circumstance;
> the "&#x<number>;" syntax was introduced with XML.
... but it effectively is valid HTML. And as you point out below
sgmllib's raison d’être is to support htmllib.
> > Additionally, there is a unicode issue here - one that is shared by
> > handle_charref, but at least that method is overrideable. If unescaping
> > remains, do it for hex character references and for values greather than
> > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
>
> For SGML, it's worse than that, since the document character set is defined in
> the SGML declaration, which is a far hairier beast than an XML
> declaration. :-)
understood
> It really sounds like sgmllib is the wrong foundation for this. While the
> module has some questionable behaviors, none of them are signifcant in the
> context it's intended context (support for htmllib). Now, I understand that
> RSS has historical issues, with HTML-as-practiced getting embedded as payload
> data with various flavors of escaping applied, and I'm not an expert in the
> details of that. Have you looked at HTMLParser as an alternate to sgmllib?
> It has better support for XHTML constructs.
HTMLParser is less forgiving, and generally less suitable for consuming
HTML as practiced.
- Sam Ruby
More information about the Python-Dev
mailing list