[Python-3000] PEP: Supporting Non-ASCII Identifiers

Tue Jun 5 10:02:37 CEST 2007

> The path of least surprise for legacy encodings might be for
> the codecs to produce whatever is closest to the original encoding
> if possible. I.e. what was one code point would remain one code
> point, and if that's not possible then normalize. I don't know if
> this is any different from always normalizing (it certainly is
> the same for Latin-1).

Depends on the normalization form. For Latin 1, the straight-forward
codec produces output that is not in NFKC, as MICRO SIGN should get
normalized to GREEK SMALL LETTER MU. However, it is normalized under
NFC.

Not sure about other codecs; for the CJK ones, I would expect to
see all sorts of issues.

> Always normalizing would have the advantage of simplicity (no matter
> what the encoding, the result is the same), and I think that is
> the real path of least surprise if you sum over all surprises.

I'd like to repeat that this is out of scope of this PEP, though.
This PEP doesn't, and shouldn't, specify how string literals get
from source to execution.

> FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize
> for some reason.

For XML, I believe the reason is performance. It is *fairly* expensive
to compute NFC in the general case, and I'm yet uncertain what a good
way would be to reduce execution cost in the "common case" (i.e.
data is already in NFC). For XML, enforcing this performance hit on
top of the already costly processing of XML would be unacceptable.

Regards,
Martin