[Python-3000] PEP: Supporting Non-ASCII Identifiers

Thu Jun 7 17:24:22 CEST 2007

On 6/5/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> >> > Unicode does say pretty clearly that (at least) canonical
> >> > equivalents must be treated the same.

On reflection, what it actually says is that you may not assume they
are different.  They can be different in the same way that two
identical strings are different under "is", but anything stronger has
to be strictly internal.

If any code outside the python core even touches the string, then the
the choice of representations becomes arbitrary, and can switch for
spurious reasons.  Immutability should prevent mid-run switching for a
single "is" string, but not for different strings that should compare
"==".

Dictionaries keys need to keep working, which means hash and equality
have to do the right thing.  Ordering may technically be a
quality-of-implementation issue, but ... normalizing strings on
creation solves an awful lot of problems, including providing a "best
practice" for C extensions.  Not normalizing will save a small amount
of time, at the cost of a never-ending hunt for rare and obscure bugs.

> >> Chapter and verse, please?

> > I am pretty sure this list is not exhaustive, but it may be
> > helpful:

> > The Identifiers Annex http://www.unicode.org/reports/tr31/

> Ah, that's in the context of identifiers, not in the context of text
> in general.

Yes, but that should also apply to dict and shelve keys.  If you want
an array of code points, then you want a tuple of ints, not text.

> > """
> > Normalization Forms KC and KD must not be blindly
> > applied to arbitrary text.
> > """

Note that it lists only the Kompatibility forms.  By implication,
forms NFC and NFD *can* be blindly applied to arbitrary text.  (And
conformance rule C9 means you have to assume that someone else might
do so, if, say, the text is python source code that may have been
externally edited.)

... """
> > They can be applied more freely to domains with restricted
> > character sets, such as in Section 13, Programming
> > Language Identifiers.
> > """
> > (section 13 then forwards back to UAX31)

> How is that a requirement that comparison should apply
> normalization?

It isn't a requirement that we apply normalization.  But

(1)  There is a requirement that semantics not change based on
external canonical [de]normalization of source code, including literal
strings.  (I agree that explicit python-level escapes -- made after
the file has already been converted from bytes to characters -- are
legitimate, just as changing 1.0 from a string to a number is
legitimate.)

(2)  It is a *suggestion* that we consider the stronger Kompatibility
normalizations for source code.

There are cases where strings which are equal under Kompatibility
shouldl be treated differently, but, I think, in practice, the
difference is more likely to be from typos or difficulty entering the
proper characters.  Normalizing to the compatibility form would be
helpful for some people (Japanese and Korean input was mentioned).

I think needed to distinguish the Kompatibility characters (and not
even in data; in source literals) will be rare enough that it is worth
making the distinction explicit.  (If you need to use a compatibility
character, then use an escape, rather than the character, so that
people will know you really mean the alternate, instead of the
"normal" character looking like that.)

-jJ