[Python-3000] PEP: Supporting Non-ASCII Identifiers

Wed Jun 6 06:41:28 CEST 2007

"Martin v. Löwis" writes:

 > > TR 15, section 19, numbered paragraph 3
 > > """
 > > Higher-level processes that transform or compare strings, or that
 > > perform other higher-level functions, must respect canonical
 > > equivalence or problems will result.
 > > """
 > 
 > That's not a mandatory requirement, but an "important aspect". Also,
 > it applies to "higher-level processes"; I would expect that string
 > comparison is not a higher-level function. Indeed, UAX#15 only
 > gives definitions, no rules.

In the language of these standards, I would expect that string
comparison is exactly the kind of higher-level process they have in
mind.  In fact, it is given as an example in what Jim quoted above.

 > > C9 A process shall not assume that the interpretations of two
 > > canonical-equivalent character sequences are distinct.
 > 
 > Right. What is "a process"?

Anything that accepts Unicode on input or produces it on output, and
claims to conform to the standard.

 > > ...
 > > Ideally, an implementation would always interpret two
 > > canonical-equivalent character sequences identically. There are
 > > practical circumstances under which implementations may reasonably
 > > distinguish them.
 > > """
 > 
 > So it should be the application's choice.

I don't think so.  I think the kind of practical circumstance they
have in mind is (eg) a Unicode document which is PGP-signed.  PGP
clearly will not be able to verify a canonicalized document, unless it
happened to be in canonical form when transmitted.  But I think it is
quite clear that they do not admit that an implementation might return
False when evaluating u"L\u00F6wis" == u"Lo\u0308wis".

 > So this *allows* to canonicalize strings, it doesn't *require* Python
 > to do so. Indeed, doing so would be fairly expensive, and therefore
 > it should not be done (IMO).

It would be much more expensive to make all string comparisons grok
canonical equivalence.  That's why it *allows* canonicalization.
Otherwise the PGP signature case would suggest that canonicalization
should be forbidden (except where that is part of the definition of
the process), and canonical equivalencing be done at the site of each
comparison.

You are correct that this is outside the scope of PEP 3131, but I
don't want your interpretation of "Unicode conformance" (which I
believe to be incorrect) to go unchallenged.