[Python-Dev] sgmllib Comments
Sam Ruby
rubys at intertwingly.net
Mon Jun 12 12:49:50 CEST 2006
Martin v. Löwis wrote:
> Sam Ruby wrote:
>> If we can agree on the behavior, I would be glad to write up a patch.
>>
>> It seems to me that the simplest way to proceed would be for the code
>> that attempts to resolve character references (both named and numeric)
>> in attributes to be isolated in a single method. Subclasses that desire
>> different behavior (including the existing Python 2.4 and prior
>> behaviour) could simply override this method.
>
> In SGML, this is problematic: The named things are not character
> references, they are entity references, and it isn't necessarily
> the case that they expand to a character. For example, &author;
> might expand to "Martin v. Löwis", and &logo; might refer to a
> bitmap image which is unparsed.
>
> That said, providing a overridable replacement function sounds
> like the right approach. To keep with tradition, I would still
> distinguish between character references and entity references,
> i.e. providing two overridable functions instead. Returning
> None could mean that no replacement is available.
>
> As for default implementations, I think they should do what
> currently happens: entity references are replaced according to
> entitydefs, character references are replaced to bytes if
> they are smaller than 256.
>
> Contrary to what others said, it appears that SGML *does*
> support hexadecimal character references, provided that
> the SGML declaraction contains the HCRO definition (which,
> for HTML and XML, is defined as HCRO "&#x"). So it seems
> safe to process hex character references by default (although
> it isn't safe to assume Unicode, IMO).
I don't see why expanding to multiple characters is a problem.
Just so that we have a tracking number and real code to anchor this
discussion, I've opened the following and attached a patch:
http://python.org/sf/1504676
This implementation does handle multiple character expansions. It does
default to exactly what the current code does. It does *not* currently
handle hexadecimal character references.
It also does pass all the current sgmllib tests, though I did not
include any additional tests in this initial patch.
- Sam Ruby
More information about the Python-Dev
mailing list