[Python-3000] Unicode IDs -- why NFC? Why allow ligatures?

Tue Jun 5 19:10:02 CEST 2007

On 6/5/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> It seems to me that what UAX#31 is saying is "Distinguishing (or not)
> between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be
> equivalent to distinguishing (or not) between LATIN CAPITAL
> LETTER A and LATIN SMALL LETTER A."  I don't know that
> I agree (or disagree) in principle.

So effectively, they consider "a" and "A" to be presentational variants.

In some languages, certain presentational variants are used depending
on word position.  I think the ID_START property does exclude letters
that cannot appear in an initial position, but putting a final
character in the middle or vice versa would still be wrong.

If identifiers are only ever typed, I suppose that isn't a problem.
If identifiers are built up in the equivalent of

    handler="do_" + name

then the character will sometimes be wrong in a way that many editors
will either hide or silently "correct."  The standard also says (but I
can't verify) that replacing the presentational variant with the
generic form will generally *improve* presentation, presumably because
there are now more systems which do the font shaping correctly than
there are systems able to handle the old character formats.

The folding rules do say that it is OK  (even good) to exclude certain
characters from certain foldings; I think we could preserve case
(including title-case?) as the only presentational variant we
recognize.

> A scan of the full table for Unicode Version 2.0 (what I have here in
> print) suggests that problematic decompositions actually are
> restricted to only a few scripts.  LATIN (CAPITAL|SMALL)
> LETTER L WITH MIDDLE DOT (used in Catalan, cf sec. 5.1 of
> UAX#31)

As best I understand it, this one would be helped by using
compatibility mappings.  There is an official way to spell l-middle
dot, but enough old texts used the "wrong" character that it has to be
special-cased for round-tripping.  Since the ID is a final
destination, we care less about round-trips, and more about "if they
switch editors, will the identifier still match".

At the very least, it is mentioned as needing special care (when used
as an identifier) in http://www.unicode.org/reports/tr31/ section 5.1
paragraph 1.

> decompositions, unlike almost all other Latin decompositions (which
> are canonical, and thus get recomposed in NFKC).  'n (Afrikaans), and
> a half-dozen Croatian digraphs corresponding to Serbian Cyrillic would
> get lost.  The Koreans would lose a truckload of partially composed
> Hangul and some archaic ones,

http://www.unicode.org/versions/corrigendum3.html suggests that many
of the Hangul are either pronunciation guide variants or even exact
duplicates (that were presumably missed when the canonicalization was
frozen?)

> the Arabic speakers their presentation forms.

http://www.unicode.org/reports/tr31/ 5.1 paragraph 3 includes:

"""It is recommended that all Arabic presentation forms be excluded
from identifiers in any event, although only a few of them must be
excluded for normalization to guarantee identifier closure."""

> And that's about it (but I may have missed a bunch because
> that database doesn't give the character classes, so I guessed for
> stuff like technical symbols -> not ID characters).

Depends on what you mean by technical symbols.  IMHO, many of them are
in fact listed as ID characters.  The math versions (generally 1D400 -
1DC7B) are included.  But
http://unicode.org/reports/tr39/data/xidmodifications.txt suggests
excluding them again.

> However, of the ones I can judge to some extent (Latin printer's
> ligatures, width variants, non-syllabic precomposed Korean Jamo), *not
> one* of the compatibility decompositions would be a loss in my
> opinion.  On the other hand, there are a bunch of cases where NKFC
> would be a marked improvement.

-jJ