[Python-Dev] sgmllib Comments

Sam Ruby rubys at intertwingly.net
Mon Jun 12 07:11:15 CEST 2006

Fred L. Drake, Jr. wrote:
> On Monday 12 June 2006 00:05, Sam Ruby wrote:
>  > Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib.
> Cool.
>  > I was investigating a bug in sgmllib which affected the feed parser (and
>  > therefore Planet), and noticed that there were changes in the SVN head
>  > of Python which broke three feed parser unit tests.
>  >
>  > It is my belief that these changes will break other existing users of
>  > sgmllib.
> This is good to know; thanks for pointing it out.
> If you can summarize the specific changes to sgmllib that cause problems for 
> the feed parser, and identify the tests there that rely on the old behavior, 
> I'll be glad to look at the problems.  I expect to have some time in the next 
> few evenings, so I should be able to look at these soon.
> Is the SourceForge CVS the definitive development source for the feed parser?

Yes: but if you check out the CVS HEAD, you won't see any failures as 
I've committed changes that mitigate the problems I've found.

However, if you get the latest release instead, you will see that feeds 
that contain < & or > in attribute values will get these 
converted to <, &, and > characters instead.  In some cases, this can 
cause problems.  Particularly if the output is reparsed by sgmllib.

Additionally, entity references in the range of &#129; to &#255; will 
cause the released Feed Parser to die with a UnicodeDecodeError.

My workarounds are to re-escape < and > characters, and to escape bare 
ampersands - beyond that I can't really tell for sure which ampersands 
need to be re-escaped, and which ones I should leave as is.

And I first try decoding attributes in the original declared encoding 
and then fall back to iso-8859-1.  If a single attribute value contains 
both non-ASCII utf-8 characters and a numeric character reference above 
&#128; then this will produce incorrect results.

I also have committed a workaround to the incorrect parsing of 
attributes with quoted markup that I originally reported.

- Sam Ruby

More information about the Python-Dev mailing list