[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Wed Jun 6 05:44:59 CEST 2007

Jim Jewett writes:

 > > Not sure what the proposal is here. If people say "we want the PEP do
 > > NFKC", I understand that as "instead of saying NFC, it should say
 > > NFKC", which in turn means "all identifiers are converted into the
 > > normal form NFKC while parsing".
 > 
 > I would prefer that.

+1

 > > With that change, the full-width ASCII characters would still be
 > > allowed in source - they just wouldn't be different from the regular
 > > ones anymore when comparing identifiers.
 > 
 > I *think* that would be OK;

+1

For the case of Japanese compatibility characters, this would make it
much easier to teach use of non-ASCII identifiers ("sensei, sensei, do
I use full-width numbers or half-width numbers?"  "Whatever you like,
kid, whatever you like."), and eliminate a common source of typos for
neophytes and experienced typists alike.

Rauli Ruohonen disagrees pretty strongly.  While I suspect I have a
substantial edge over Rauli in experience with daily use of Japanese,
that worries me.  I will be polling my students (for "would you be
more interested in learning Python if ...") and my more or less
able-to-program colleagues.

BTW -- Martin, what about numeric tokens?  I don't expect ideographic
numbers to be translated to decimal, but if full-width "ABC123" is
decomposed to halfwidth as an identifier, I think Japanese will expect
a literal full-width "123" to be recognized as the decimal number 123
(and similarly for e notation for floating point).  I really think
this should be in the scope of this PEP.  (Feel free to count it as a
reason against NFKC, if that simplifies things for you.)

 > so long as they mean the same thing, it is just a quirk like using
 > a different font.  I am slightly concerned that it might mean
 > "string as string" and "string as identifier" have different tests
 > for equality.

It does mean that; see Rauli's code.  Does anybody know if this
bothers LISP users, where identifiers are case-insensitive?  (My Emacs
LISP experience is useless, since identifiers are case-sensitive.)

We will need (possibly external) tools to warn about such
decompositions, and a sophisticated tool should warn about accesses to
identifier dictionaries in the presence of such decompositions as
well.

 > > Another option would be to require that the source is in NFKC already,
 > > where I then ask again what precisely that means in presence of
 > > non-UTF source encodings.

I don't think this is a good idea.

NB: if there's substantial resistance from users of some of the other
classes of compatibility characters, I have an acceptable fallback.
NFC plus external tools to audit for NFKC would be usable, and for the
character sets I'm likely to encounter, it would be well-defined for
the usual encodings.