[Python-3000] Support for PEP 3131

Fri May 25 06:36:12 CEST 2007

"Guido van Rossum" <guido at python.org> wrote:
> On 5/24/07, Ka-Ping Yee <python at zesty.ca> wrote:
> > To pit this as "ascii lovers vs. non-ascii lovers" is a pretty large
> > oversimplification.  You could name them "people who want to be able
> > to know what the code says" and "people who don't mind not being able
> > to know what the code says".  Or you could name them "people who want
> > Python's lexical syntax to be something they fully understand" and
> > "people who don't mind the extra complexity".  Or "people who don't
> > want Python's lexical syntax to be tied to a changing external
> > standard" and "people who don't mind the extra variability."
> >
> > However you characterize them, keep in mind that those in the former
> > group are asking for default behaviour that 100% of Python users
> > already use and understand.  There's no cost to keeping identifiers
> > ASCII-only because that's what Python already does.
> >
> > I think that's a pretty strong reason for making the new, more complex
> > behaviour optional.
> 
> If there's a security argument to be made for restricting the alphabet
> used by code contributions (even by co-workers at the same company), I
> don't see why ASCII-only projects should have it easier than projects
> in other cultures.

For the sake of argument, pretend that we went with a command line
option to enable certain character sets.  In my opinion, there should be
a default character set that is allowed.  The only character set that
makes sense as a default, ignoring previously-existing environment
variables (which don't necessarily help us), is ascii.

Why?  Primarily because ascii identifiers are what are allowed today,
and have been allowed for 15 years.  But there is this secondary data
point that Stephen Turnbull brought up; 95% of users (of Emacs) never
touch non-ascii code.  Poor extrapolation of statistics aside, to make
the default be something that does not help 95% of users seems a
bit... overenthusiastic.  Where else in Python have we made the default
behavior only desired or useful to 5% of our users?

With that said, and with what Stephen and others have said about unicode
in Java, I don't believe there will be terribly significant cross
polination of non-ascii identifier source.  Of the source that *does*
become popular and has non-ascii identifiers, I don't believe that it
would take much time before there are normalized versions of the source,
either published by the original authors or created by users. (having a
tool to do unicode -> ascii transliteration of identifiers would make
this a non-issue)

Though others don't like it, I think that having a command line option
to enable other character sets is a reasonable burdon to place on the 5%
of users that will experience non-ascii identifiers.  For those who work
with it on a regular basis, having an environment variable should be
sufficient (with command line arguments to add additional allowable
character sets).  For those who wish to import code at runtime and/or
have arbitrary identifiers, having an interface for adding or removing
allowable character sets for code imported during runtime should work
reasonably well (both for people who want to allow arbitrary identifiers,
and those who want to restrict identifiers after the runtime system is
up).

In terms of speed issues that Guillaume has brought up, this is a
non-issue. The time to verify identifiers as a pyc is loaded, when every
identifier in a pyc file is interned on loading, is insignificant;
especially when in Python one can do...

    for identifier in identifiers:
        for character in identifier:
            if character not in allowable_characters:
                raise ImportError("...")

And considering we can do *millions* of dictionary/set lookups each
second on a modern machine, I can't imagine that identifier verification
time will be a significant burden.

 - Josiah