[Python-3000] PEP: Supporting Non-ASCII Identifiers
"Martin v. Löwis"
martin at v.loewis.de
Tue Jun 5 21:21:59 CEST 2007
>> > Unicode does say pretty clearly that (at least) canonical equivalents
>> > must be treated the same.
>
>> Chapter and verse, please?
>
> I am pretty sure this list is not exhaustive, but it may be helpful:
>
> The Identifiers Annex http://www.unicode.org/reports/tr31/
Ah, that's in the context of identifiers, not in the context of text
in general.
> """
> UAX31-C2. An implementation claiming conformance to Level 1 of this
> specification shall describe which of the following it observes:
>
> R1 Default Identifiers
> R2 Alternative Identifiers
> R3 Pattern_White_Space and Pattern_Syntax Characters
> R4 Normalized Identifiers
> R5 Case-Insensitive Identifiers
> """
>
> I interpret this as "If we normalize the Identifiers, then we must
> observe R4." R4 lets us exclude individual characters from
> normalization, but it says that two IDs with the same Normalization
> Form are equivalent, unless they include specifically excluded
> characters.
Correct, and that's indeed what PEP 3131 does.
> """
> Normalization Forms KC and KD must not be blindly applied to arbitrary
> text.
> """ ... """
> They can be applied more freely to domains with restricted character
> sets, such as in Section 13, Programming Language Identifiers.
> """
> (section 13 then forwards back to UAX31)
How is that a requirement that comparison should apply normalization?
> TR 15, section 19, numbered paragraph 3
> """
> Higher-level processes that transform or compare strings, or that
> perform other higher-level functions, must respect canonical
> equivalence or problems will result.
> """
That's not a mandatory requirement, but an "important aspect". Also,
it applies to "higher-level processes"; I would expect that string
comparison is not a higher-level function. Indeed, UAX#15 only
gives definitions, no rules.
> C9 A process shall not assume that the interpretations of two
> canonical-equivalent character sequences are distinct.
Right. What is "a process"?
> ...
> Ideally, an implementation would always interpret two
> canonical-equivalent character sequences identically. There are
> practical circumstances under which implementations may reasonably
> distinguish them.
> """
So it should be the application's choice.
> """
> C10 When a process purports not to modify the interpretation of a
> valid coded character representation, it shall make no change to that
> coded character representation other than the possible replacement of
> character sequences by their canonical-equivalent sequences or the
> deletion of noncharacter code points.
> ...
> All processes and higher-level protocols are required to abide by C10
> as a minimum. However, higher-level protocols may define additional
> equivalences that do not constitute modifications under that protocol.
> For example, a higher-level protocol may allow a sequence of spaces to
> be replaced by a single space.
> """
So this *allows* to canonicalize strings, it doesn't *require* Python
to do so. Indeed, doing so would be fairly expensive, and therefore
it should not be done (IMO).
>> Why that? The caller of getattr would need to apply normalization in
>> case the input isn't known to be normalized?
>
> OK, I suppose that might work, if documented, but ... it seems like
> another piece of boilerplate; when it isn't there, it won't really be
> because the input is normalized so after as it is because the author
> didn't think about normalization.
No. It might also be because the author *knows* that the string is
already normalized.
Regards,
Martin
More information about the Python-3000
mailing list