[Python-Dev] sgmllib Comments

Sam Ruby rubys at intertwingly.net
Mon Jun 12 12:49:50 CEST 2006


Martin v. Löwis wrote:
> Sam Ruby wrote:
>> If we can agree on the behavior, I would be glad to write up a patch.
>>
>> It seems to me that the simplest way to proceed would be for the code
>> that attempts to resolve character references (both named and numeric)
>> in attributes to be isolated in a single method.  Subclasses that desire
>> different behavior (including the existing Python 2.4 and prior
>> behaviour) could simply override this method.
> 
> In SGML, this is problematic: The named things are not character
> references, they are entity references, and it isn't necessarily
> the case that they expand to a character. For example, &author;
> might expand to "Martin v. Löwis", and &logo; might refer to a
> bitmap image which is unparsed.
> 
> That said, providing a overridable replacement function sounds
> like the right approach. To keep with tradition, I would still
> distinguish between character references and entity references,
> i.e. providing two overridable functions instead. Returning
> None could mean that no replacement is available.
> 
> As for default implementations, I think they should do what
> currently happens: entity references are replaced according to
> entitydefs, character references are replaced to bytes if
> they are smaller than 256.
> 
> Contrary to what others said, it appears that SGML *does*
> support hexadecimal character references, provided that
> the SGML declaraction contains the HCRO definition (which,
> for HTML and XML, is defined as HCRO "&#x"). So it seems
> safe to process hex character references by default (although
> it isn't safe to assume Unicode, IMO).

I don't see why expanding to multiple characters is a problem.

Just so that we have a tracking number and real code to anchor this 
discussion, I've opened the following and attached a patch:

http://python.org/sf/1504676

This implementation does handle multiple character expansions.  It does 
default to exactly what the current code does.  It does *not* currently 
handle hexadecimal character references.

It also does pass all the current sgmllib tests, though I did not 
include any additional tests in this initial patch.

- Sam Ruby


More information about the Python-Dev mailing list