[Python-3000] PEP 3131 roundup

Wed Jun 6 01:18:09 CEST 2007

On 6/5/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

> >    1. Python will lose the ability to make a reliable round trip to
> >       a human-readable display on screen or on paper.

> Correct. Was already the case, though, because of comments
> and string literals.

But these are usually less important; when written as literals, they
are normally part of the User Interface, and if the user can't see the
difference, it doesn't matter.

There are exceptions, such as the "HELO" magic cookie in the
(externally defined) SMTP protocol, but I think these exceptions are
uncommon -- and outside python's control anyhow.

> >    5. Languages with non-ASCII identifiers use different
> >  character sets  and normalization schemes; PEP 3131's
> > choices are non-obvious.

> I disagree. PEP 3131 follows UAX#31 literally, and makes that
> decision very clear. If people still cannot see that,

I think "obvious" referred to the reasoning, not the outcome.

I can tell that the decision was "NFC, anything goes", but I don't see why.

(1)
I am not sure why it was NFC; UAX 31 seems agnostic on which
normalization form to use.

The only explicit recommendations I can find suggest using NFKC for
identifiers.  http://www.unicode.org/faq/normalization.html#2

(Outside of that recommendation for KC, it isn't even clear why we
should use the Composed form.  As of tonight, I realized that
"composed" means less than I thought, and the actual algorithm means
it should work as well as the Decomposed forms -- but I had missed
that detail the first several times I read about the different
Normalization forms, and it certainly isn't included directly in the
PEP.)

(2)
I cannot understand why ID_START/CONTINUE was chosen instead of the
newer and more recommended XID_START/CONTINUE.  From UAX31 section 2:
"""
The XID_Start and XID_Continue properties are improved lexical classes
that incorporate the changes described in Section 5.1, NFKC
Modifications. They are recommended for most purposes, especially for
security, over the original ID_Start and ID_Continue properties.
"""

Nor can I understand why the additional restrictions in
xidmodifications (from TR39) were ignored.  The reason to remove those
characters is given as
"""
The restricted characters are characters not in common use, removed so
as to further reduce the possibilities for visual confusion.
Initially, the following are being excluded: characters not in modern
use; characters only used in specialized fields, such as liturgical
characters, mathematical letter-like symbols, and certain phonetic
alphabetics; and ideographic characters that are not part of a set of
core CJK ideographs consisting of the CJK Unified Ideographs block
plus IICore (the set of characters defined by the IRG as the minimal
set of required ideographs for East Asian use). A small number of such
characters are allowed back in so that the profile includes all the
characters in the country-specific restricted IDN lists:
"""

As best I can tell, the remaining list is *still* too generous to be
called conservative, but the characters being removed are almost
certainly good choices for removal -- no one's native language
requires  them.

> > B. Should the default behaviour accept only ASCII identifiers, or
> >    should it accept identifiers containing non-ASCII characters?

> > D. Should the identifier character set be configurable?

> Still seems to be the same open issue.

Defaulting to ASCII or defaulting to "accept unicode" is one issue.

A related but separate issue is whether accepting unicode is a single
on/off switch, or whether it will be possible to accept only some
unicode characters.

As written, there is no good way to accept, say, Japanese characters,
but not Cyrillic.

I would prefer to whitelist individual characters or scripts, but
there should at least be a way to exclude certain characters.

http://www.unicode.org/reports/tr39/data/intentional.txt

is a list of characters that *should* be impossible to distinguish
visually.  It isn't just that the standard representations are
identical; (like some of the combining marks looking like quote
signs), it is that the (distinct abstract) characters *should* use the
same glyph, so long as they are in the same (or even harmonized)
fonts.

Several of the Greek and Cyrillic characters are glyph-identical with
ASCII letters.  I won't say that people using those scripts shouldn't
be allowed to use those letters, but *I* certainly don't want to get
code using them just because I allowed the ö.

-jJ