[Python-Dev] sgmllib Comments

Mon Jun 12 08:18:50 CEST 2006

Sam Ruby wrote:
> If we can agree on the behavior, I would be glad to write up a patch.
> 
> It seems to me that the simplest way to proceed would be for the code
> that attempts to resolve character references (both named and numeric)
> in attributes to be isolated in a single method.  Subclasses that desire
> different behavior (including the existing Python 2.4 and prior
> behaviour) could simply override this method.

In SGML, this is problematic: The named things are not character
references, they are entity references, and it isn't necessarily
the case that they expand to a character. For example, &author;
might expand to "Martin v. Löwis", and &logo; might refer to a
bitmap image which is unparsed.

That said, providing a overridable replacement function sounds
like the right approach. To keep with tradition, I would still
distinguish between character references and entity references,
i.e. providing two overridable functions instead. Returning
None could mean that no replacement is available.

As for default implementations, I think they should do what
currently happens: entity references are replaced according to
entitydefs, character references are replaced to bytes if
they are smaller than 256.

Contrary to what others said, it appears that SGML *does*
support hexadecimal character references, provided that
the SGML declaraction contains the HCRO definition (which,
for HTML and XML, is defined as HCRO "&#38;#x"). So it seems
safe to process hex character references by default (although
it isn't safe to assume Unicode, IMO).

Regards,
Martin