[Python-3000] PEP 3131 roundup

Thu Jun 7 00:22:17 CEST 2007

On 6/6/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > I think "obvious" referred to the reasoning, not the outcome.

> > I can tell that the decision was "NFC, anything goes", but I don't see why.

> I think I'm repeating myself: Because UAX 31 says so. That's it. There
> is a standard that experts in the domain have specified, and PEP 3131
> follows it. Following standards is a good thing, deviating from them
> is a bad thing.

I think we are reading UAX31 very differently.

If it is (or even seems) ambiguous, then we need to specify our interpretation.

> > (2)
> > I cannot understand why ID_START/CONTINUE was chosen instead of the
> > newer and more recommended XID_START/CONTINUE.  From UAX31 section 2:
> > """
> > The XID_Start and XID_Continue properties are improved lexical classes
> > that incorporate the changes described in Section 5.1, NFKC
> > Modifications. They are recommended for most purposes, especially for
> > security, over the original ID_Start and ID_Continue properties.
> > """

> Right. I read it that these should be used when 5.1 is considered
> in the language. This, in turn, should be used when the
> normalization form is NFKC:

I read that as

XID is almost always better.  XID is better for security in
particular, but also better for other things.  And as an extra bonus,
XID even already takes care of some 5.1 issues for you.

And my personal opinion is that those 5.1 issues are not really
restricted to NFKC.  Other normalization forms won't get syntactic
errors over them, but the results could still be nonsense.

Issue 1 is that Catalan treats a 0xB7 as a character instead of as
punctuation.  The unicode recommendation (*required* only for NFKC,
but already supported by XID, since it is recommended) says "OK, it
isn't syntax or whitespace, and it is a character sometimes in
practice, so we'll allow it."

Issue 2 says "Technically these are characters, but they should never
be used to start a word, so don't start an identifier with them
anyhow."  If you're not using NFKC, you *can* just ignore the problem
(and produce garbage), but you probably shouldn't.  XID takes care of
it for you.  (At least for these characters.)

Issue 3 says "OK, these characters don't work with NFKC -- but you
shouldn't be using them anyhow."  It even says explicitly that

    "It is recommended that all Arabic presentation
    forms be excluded from identifiers in any event"

Note that neither ID nor XID actually remove all the Arabic
presentation forms, despite this clear recommendation.  Technically,
they are characters, and *could* be processed.  XID removes the ones
that break NFKC, and xidmodifications removes some more (hopefully,
all the rest, but I haven't verified that).

> """
> Where programming languages are using NFKC to fold differences between
> characters, they need the following modifications of the identifier
> syntax from the Unicode Standard to deal with the idiosyncrasies of a
> small number of characters. These modifications are reflected in the
> XID_Start and XID_Continue properties.
> """

> As the PEP does not use NFKC (currently), it should not use XID_Start
> and XID_Continue either.

I read that as "If you are using NFKC, then you need to do some extra
work.  But notice that if you are using the new and improved XID, then
some of this work was already done for you..."

> > Nor can I understand why the additional restrictions in
> > xidmodifications (from TR39) were ignored.

> Consideration of UTR 39 is listed as an open issue. One problem
> with it is that using it would restrict the language over time,
> so that previously correct programs might not be correct anymore
> in a future version. So using it might break backwards
> compatibility.

Then we should start with a more restricted charset, and expand it over time.

The restrictions in xidmodifications are not remotely sufficient for
security, even now.  (Doing that would require restricting some
characters that are actually needed in some languages.)

Instead, xidmodifications represents (a mechanically determined subset
of) characters that can be removed cheaply, because they shouldn't be
used in identifiers anyhow.

-jJ