[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Tue Jun 5 11:19:03 CEST 2007

Jim Jewett writes:

 > > The PEP assumes NFC, but I haven't really understood why, unless that
 > > is required for compatibility with other systems (in which case, it
 > > should be made explicit).

"Martin v. Löwis" writes:

 > It's because UAX#31 tells us to use NFC, in section 5
 > 
 > "Generally if the programming language has case-sensitive identifiers,
 > then Normalization Form C is appropriate; whereas, if the programming
 > language has case-insensitive identifiers, then Normalization Form KC is
 > more appropriate."
 > 
 > As Python has case-sensitive identifiers, NFC is appropriate.

It seems to me that what UAX#31 is saying is "Distinguishing (or not)
between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be equivalent to
distinguishing (or not) between LATIN CAPITAL LETTER A and LATIN SMALL
LETTER A."  I don't know that I agree (or disagree) in principle.

Here's what UAX#15 has to say:

----------------
Normalization Forms KC and KD must not be blindly applied to arbitrary
text. Because they erase many formatting distinctions, they will
prevent round-trip conversion to and from many legacy character sets,
and unless supplanted by formatting markup, they may remove
distinctions that are important to the semantics of the text. It is
best to think of these Normalization Forms as being like uppercase or
lowercase mappings: useful in certain contexts for identifying core
meanings, but also performing modifications to the text that may not
always be appropriate. They can be applied more freely to domains with
restricted character sets, such as in Section 13,  Programming
Language Identifiers.
----------------

Note that Section 13 == UAX#31 (from which Martin is quoting).  I
don't see this section as being at all supportive of NFC over NFKC,
though.

Some detailed observations biased by my personal tastes:

It seems to me that while I sometimes find it useful for FOO and
foo to be different identifiers, I would almost always consider R3RS
and R³RS to be the same identifier.  The contrast is just too small to
be useful.  And I would never distinguish between a three-character
fine (fi - n - e) and a four-character fine (f - i - n - e).  I'd
really love to see the printer's ligatures gone.

I'd love to get rid of full-width ASCII and halfwidth kana (via
compatibility decomposition).  Native Japanese speakers often use them
interchangably with the "proper" versions when correcting typos and
updating numbers in a series.  Ugly, to say the least.  I don't think
that native Japanese would care, as long as the decomposition is done
internally to Python.

A scan of the full table for Unicode Version 2.0 (what I have here in
print) suggests that problematic decompositions actually are
restricted to only a few scripts.  LATIN (CAPITAL|SMALL) LETTER L WITH
MIDDLE DOT (used in Catalan, cf sec. 5.1 of UAX#31) are compatibility
decompositions, unlike almost all other Latin decompositions (which
are canonical, and thus get recomposed in NFKC).  'n (Afrikaans), and
a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would
get lost.  The Koreans would lose a truckload of partially composed
Hangul and some archaic ones, the Arabic speakers their presentation
forms.  And that's about it (but I may have missed a bunch because
that database doesn't give the character classes, so I guessed for
stuff like technical symbols -> not ID characters).

I suspect that as long as they have the precomposed Hangul, partial-
syllable "ligature" forms won't be an issue for Koreans.  I can't even
distinguish the archaic versions from their compatibility equivalents
by eye, although I'm comfortable with pronouncing Hangul.  I have no
opinion on the Latin decompositions mentioned above or the Arabic
presentation forms.

However, of the ones I can judge to some extent (Latin printer's
ligatures, width variants, non-syllabic precomposed Korean Jamo), *not
one* of the compatibility decompositions would be a loss in my
opinion.  On the other hand, there are a bunch of cases where NKFC
would be a marked improvement.