[Python-3000] Support for PEP 3131

Thu May 24 00:35:52 CEST 2007

On Wed, 23 May 2007, Stephen J. Turnbull wrote:
>  > It means users could see the usability benefits of PEP3131, but the
>  > python internals could still work with ASCII only.
>
> But this reasoning is not coherent.  Python internals will have no
> problems with non-ASCII; in fact, they would have no problems with
> tokens containing Cf characters or even reserved code points.  Just
> give an unambiguous grammar for tokens composed of code points.  It's
> only when a human enters the loop (ie, presentation of the identifier
> on an output stream) that they cause problems.

You've got this backwards, and I suspect that's part of the root of
the disagreement.  It's not that "when humans enter the loop they
cause problems."  The purpose of the language is to *serve humans*.
Without humans, we would just use machine code instead of Python.
If it doesn't work for humans, it's not because the humans are broken,
the language is broken.

The grammar has to be something a human can understand.

(And if 90%, or more than 50%, of the tools are "broken" with respect
to the language, that's a language problem, not just a tool problem.)

> I propose it would be useful to provide a standard mechanism for
> auditing the input stream.  There would be one implementation for the
> stdlib that complains[1] about non-ASCII characters and possibly
> non-English words, and IMO that should be the default

This should be built in to the Python interpreter and on by default,
unless it is turned off by a command-line switch that says "I want to
allow the full set of Unicode identifier characters in identifiers."

> A second one should provide a very conservative Unicode set, with
> provision for amendment as experience shows restriction to be
> desirable or extension to be safe.

If we are going to allow Unicode identifiers at all, then I would
recommend only allowing identifiers that are already normalized
(in NFC).  If this recommendation is rejected, then I propose that
the second-level mode that Stephen suggests here only allow
normalized identifiers.

In summary, my preference ordering of the possibilities would be:

    1.  Identifiers remain ASCII-only.

    2.  "python" allows only ASCII identifiers.  "python -U" allows
        Unicode identifiers that are in NFC and use a conservative,
        *fixed* subset of the available characters.  Support for
        "-U" is a compile-time option, preferably not compiled into
        official binary releases of Python.

    3.  "python" and "python -U" are as above.  "python -UU" allows
        all Unicode identifier characters (which may grow over time
        as the Unicode standard changes).  Support for "-UU" is a
        compile-time option, never on in official binary releases of
        Python, and discouraged with "here be dragons" warnings, etc.

The ideas that I'm in favour of include:

    (a) Require identifiers to be in ASCII.

    (b) Require a compile-time option to enable non-ASCII identifiers.

    (c) Require a command-line flag to enable non-ASCII identifiers.

    (d) Require identifiers to be in NFC.

    (e) Use a character set that is fixed over time.

-- ?!ng