[Python-3000] Support for PEP 3131

Wed May 23 18:26:55 CEST 2007

On 5/23/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Jim Jewett writes:

>  > It simplifies checking for identifiers that *don't* stick to ASCII,

> Only if you assume that people will actually perceive the 10-character
> string "L\u00F6wis" as an identifier, regardless of the fact that any
> programmable editor can be trained to display the 5-character string
> "Löwis" in a very small amount of code.  Conversely, any programmable
> editor can easily be trained to take the internal representation
> "Löwis" and display it as "L\u00F6wis", giving all the benefits of the
> representation you propose.  But who would ever enable it?

I would.

I would like an alert (and possibly an import exception) on any code
whose *executable portion* is not entirely in ASCII.

Comments aren't a problem, unless they somehow erase or hide other
characters or line breaks.  Strings aren't a problem unless I evaluate
them.  Code ... I want to know if there is some non-ASCII.

Even Latin-1 isn't much of a problem, except for single-quotes.  I do
want to know if 'abc' is a string or an identifier made with the
"prime" letter.

This might be an innocent cut-and-paste error (and how else would most
people enter non-native characters), but it is still a problem -- and
python would often create a new variable instead of warning me.

> The only issues PEP 3131 should be concerned with *defining*
> are those that cause problems with canonicalization, and the range of
> characters and languages allowed in the standard library.

Fair enough -- but the problem is that this isn't a solved issue yet;
the unicode group themselves make several contradictory
recommendations.

I can come up with rules that are probably just about right, but I
will make mistakes (just as the unicode consortium itself did, which
is why they have both ID and XID, and why both have stability
characters).  Even having read their reports, my initial rules would
still have banned mixed-script, which would have prevented your edict-
example.

So I'll agree that defining the charsets and combinations and
canonicalization is the right scope; I just feel that best practice
isn't yet clear enough.

> I propose it would be useful to provide a standard mechanism for
> auditing the input stream.  There would be one implementation for the
> stdlib that complains[1] about non-ASCII characters and possibly
> non-English words, and IMO that should be the default (for the reasons
> Ka-Ping gives for opposing the whole PEP).  A second one should
> provide a very conservative Unicode set, with provision for amendment
> as experience shows restriction to be desirable or extension to be
> safe.  A third, allowing any character that can be canonicalized into
> the form that PEP 3131 allows internally, is left as an exercise for
> the reader wild 'n' crazy enough to want to use it.

This might deal with my concerns.  It is a bit more complicated than
the current plans.

-jJ