[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?
Stephen J. Turnbull
stephen at xemacs.org
Wed Jun 6 05:44:59 CEST 2007
Jim Jewett writes:
> > Not sure what the proposal is here. If people say "we want the PEP do
> > NFKC", I understand that as "instead of saying NFC, it should say
> > NFKC", which in turn means "all identifiers are converted into the
> > normal form NFKC while parsing".
> I would prefer that.
> > With that change, the full-width ASCII characters would still be
> > allowed in source - they just wouldn't be different from the regular
> > ones anymore when comparing identifiers.
> I *think* that would be OK;
For the case of Japanese compatibility characters, this would make it
much easier to teach use of non-ASCII identifiers ("sensei, sensei, do
I use full-width numbers or half-width numbers?" "Whatever you like,
kid, whatever you like."), and eliminate a common source of typos for
neophytes and experienced typists alike.
Rauli Ruohonen disagrees pretty strongly. While I suspect I have a
substantial edge over Rauli in experience with daily use of Japanese,
that worries me. I will be polling my students (for "would you be
more interested in learning Python if ...") and my more or less
BTW -- Martin, what about numeric tokens? I don't expect ideographic
numbers to be translated to decimal, but if full-width "ABC123" is
decomposed to halfwidth as an identifier, I think Japanese will expect
a literal full-width "123" to be recognized as the decimal number 123
(and similarly for e notation for floating point). I really think
this should be in the scope of this PEP. (Feel free to count it as a
reason against NFKC, if that simplifies things for you.)
> so long as they mean the same thing, it is just a quirk like using
> a different font. I am slightly concerned that it might mean
> "string as string" and "string as identifier" have different tests
> for equality.
It does mean that; see Rauli's code. Does anybody know if this
bothers LISP users, where identifiers are case-insensitive? (My Emacs
LISP experience is useless, since identifiers are case-sensitive.)
We will need (possibly external) tools to warn about such
decompositions, and a sophisticated tool should warn about accesses to
identifier dictionaries in the presence of such decompositions as
> > Another option would be to require that the source is in NFKC already,
> > where I then ask again what precisely that means in presence of
> > non-UTF source encodings.
I don't think this is a good idea.
NB: if there's substantial resistance from users of some of the other
classes of compatibility characters, I have an acceptable fallback.
NFC plus external tools to audit for NFKC would be usable, and for the
character sets I'm likely to encounter, it would be well-defined for
the usual encodings.
More information about the Python-3000