[Python-3000] PEP: Supporting Non-ASCII Identifiers

"Martin v. Löwis" martin at v.loewis.de
Tue Jun 5 21:21:59 CEST 2007


>> > Unicode does say pretty clearly that (at least) canonical equivalents
>> > must be treated the same.
> 
>> Chapter and verse, please?
> 
> I am pretty sure this list is not exhaustive, but it may be helpful:
> 
> The Identifiers Annex http://www.unicode.org/reports/tr31/

Ah, that's in the context of identifiers, not in the context of text
in general.

> """
> UAX31-C2.    An implementation claiming conformance to Level 1 of this
> specification shall describe which of the following it observes:
> 
> R1 Default Identifiers
> R2 Alternative Identifiers
> R3 Pattern_White_Space and Pattern_Syntax Characters
> R4 Normalized Identifiers
> R5 Case-Insensitive Identifiers
> """
> 
> I interpret this as "If we normalize the Identifiers, then we must
> observe R4."  R4 lets us exclude individual characters from
> normalization, but it says that two IDs with the same Normalization
> Form are equivalent, unless they include specifically excluded
> characters.

Correct, and that's indeed what PEP 3131 does.

> """
> Normalization Forms KC and KD must not be blindly applied to arbitrary
> text.
> """ ... """
> They can be applied more freely to domains with restricted character
> sets, such as in Section 13, Programming Language Identifiers.
> """
> (section 13 then forwards back to UAX31)

How is that a requirement that comparison should apply normalization?


> TR 15, section 19, numbered paragraph 3
> """
> Higher-level processes that transform or compare strings, or that
> perform other higher-level functions, must respect canonical
> equivalence or problems will result.
> """

That's not a mandatory requirement, but an "important aspect". Also,
it applies to "higher-level processes"; I would expect that string
comparison is not a higher-level function. Indeed, UAX#15 only
gives definitions, no rules.

> C9 A process shall not assume that the interpretations of two
> canonical-equivalent character sequences are distinct.

Right. What is "a process"?

> ...
> Ideally, an implementation would always interpret two
> canonical-equivalent character sequences identically. There are
> practical circumstances under which implementations may reasonably
> distinguish them.
> """

So it should be the application's choice.

> """
> C10 When a process purports not to modify the interpretation of a
> valid coded character representation, it shall make no change to that
> coded character representation other than the possible replacement of
> character sequences by their canonical-equivalent sequences or the
> deletion of noncharacter code points.
> ...
> All processes and higher-level protocols are required to abide by C10
> as a minimum.  However, higher-level protocols may define additional
> equivalences that do not constitute modifications under that protocol.
> For example, a higher-level protocol may allow a sequence of spaces to
> be replaced by a single space.
> """

So this *allows* to canonicalize strings, it doesn't *require* Python
to do so. Indeed, doing so would be fairly expensive, and therefore
it should not be done (IMO).

>> Why that? The caller of getattr would need to apply normalization in
>> case the input isn't known to be normalized?
> 
> OK, I suppose that might work, if documented, but ... it seems like
> another piece of boilerplate; when it isn't there, it won't really be
> because the input is normalized so after as it is because the author
> didn't think about normalization.

No. It might also be because the author *knows* that the string is
already normalized.

Regards,
Martin


More information about the Python-3000 mailing list