[Python-3000] Support for PEP 3131

Thu May 24 13:17:57 CEST 2007

Ka-Ping Yee writes:

 > On Wed, 23 May 2007, Stephen J. Turnbull wrote:

 > >  > It means users could see the usability benefits of PEP3131, but the
 > >  > python internals could still work with ASCII only.

 > > But this reasoning is not coherent.  Python internals will have no
 > > problems with non-ASCII; in fact, they would have no problems with
 > > tokens containing Cf characters or even reserved code points.  Just
 > > give an unambiguous grammar for tokens composed of code points.  It's
 > > only when a human enters the loop (ie, presentation of the identifier
 > > on an output stream) that they cause problems.
 > 
 > You've got this backwards, and I suspect that's part of the root of
 > the disagreement.  It's not that "when humans enter the loop they
 > cause problems."  The purpose of the language is to *serve humans*.

Of course!  "Incoherent" refers *only* to "python internals".  We need
to look at the parts of the loop where the humans are.

N.B. I take offense at your misquote.  *Humans do not cause problems.*
It is *non-ASCII tokens* that *cause* the (putative) problem.  However,
the alleged problems only arise when humans are present.

 > The grammar has to be something a human can understand.

There are an infinite number of ASCII-only Python tokens.  Whether
those tokens are lexically composed of a small fixed finite alphabet
vs. a large extensible finite alphabet doesn't change anything in
terms of understanding the *grammar*.

The character-identity problem is vastly aggravated (created, if you
insist) by large numbers of characters, but that is something
separate.  I don't understand why you conflate lexical issues with the
still-fits-in-*my*-pin-head simplicity of the Python grammar.  Am I
missing something?

 > (And if 90%, or more than 50%, of the tools are "broken" with respect
 > to the language, that's a language problem, not just a tool problem.)

It's a *problem* for the tools, because they may become obsolete,
depending on how expensive the feature of handling new language
constructs is.  It is an *issue* for the language, *not* a "problem"
in the same sense.  The language designer must balance the problems
faced by the tools, and the cost of upgrading them---including users'
switching costs!---against the benefits of the new language feature.
Nothing new here.

The question is how expensive will the upgrade be, and what are the
benefits.  My experience suggests that the cost is negligible *because
most users won't use non-ASCII identifiers*, and they'll just stick
with their ASCII-only tools.  The benefits are speculative; I know
that my students love the idea of a programming language that doesn't
look like English (which has extremely painful associations for most).

And there are cases (Dutch tax law, Japanese morphology) where having
a judicious selection of non-ASCII identifiers is very convenient.
Specifically, from my own experience, if I don't know what a
particular function in edict is supposed to do, I just ask the nearest
Japanese.  And they tell me, "oh, that parses the INFLECTION-TYPE of
PART-OF-SPEECH", and when I look blank, they continue, "you know, the
'-masu' in 'gozaimasu'".  Now, since there is no exact equivalent to
"-masu" in English (or any European language AFAIK), it would be
impossible to give a precise self-documenting name in ASCII.  Sure,
you can work around this -- but why not put down the ASCII hammer and
save on all that ibuprofen?

 > > I propose it would be useful to provide a standard mechanism for
 > > auditing the input stream.  There would be one implementation for the
 > > stdlib that complains[1] about non-ASCII characters and possibly
 > > non-English words, and IMO that should be the default
 > 
 > This should be built in to the Python interpreter and on by default,
 > unless it is turned off by a command-line switch that says "I want to
 > allow the full set of Unicode identifier characters in identifiers."

I'd make it more tedious and more flexible to relax the restriction,
actually.  "python" gives you the stdlib, ASCII-only restriction.
"python -U TABLE" takes a mandatory argument, which is the table of
allowed characters.  If you want to rule out "stupid file substitution
tricks", TABLE could take the special arguments "stdlib" and "stduni"
which refer to built-in tables.  But people really should be able to
restrict to "Japanese joyo kanji, kana, and ASCII only" or "IBM
Japanese only" as local standards demand, so -U should also be able to
take a file name, or a module name, or something like that.

 > If we are going to allow Unicode identifiers at all, then I would
 > recommend only allowing identifiers that are already normalized
 > (in NFC).

Already in the PEP.

 > The ideas that I'm in favour of include:
 > 
 >     (e) Use a character set that is fixed over time.

The BASIC that I learned first only had 26 user identifiers.  Maybe
that's the way we should go?<duck />