[Python-ideas] Non-ASCII in Python syntax? [was: Null coalescing operator]

Sun Oct 30 03:00:33 EDT 2016

Paul Moore writes:
 > On 29 October 2016 at 18:19, Stephen J. Turnbull
 > <turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:
 > >> For better or worse, it may be emoji that drive that change ;-)
 > >
 > > I suspect that the 100 million or so Chinese, Japanese, Korean, and
 > > Indian programmers who have had systems that have no trouble
 > > whatsoever handling non-ASCII for as long they've used computers will
 > > drive that change.
 > 
 > My apologies. You are of course absolutely right.

tl;dr: A quick apology for the snark, and an attempt at FUD reduction.
Using non-ASCII characters will involve some cost, but there are real
benefits, and the fear and loathing often evoked by the prospect is
unnecessary.  I'm not ready to advocate introduction *right* now, but
"never" isn't acceptable either. :-)

On with the show:

"Absolutely" is more than I deserve, as I was being a bit snarky.

That said, Ed Yourdon wrote a book in 1990 or so with the
self-promoting title of "Decline and Fall of the American
Programmer"[1] in which he argued that for many kinds of software
outsourcing to China, India, or Ireland got you faster, better,
cheaper, and internationalized, with no tradeoffs.  (The "and
internationalized" is my hobby horse, it wasn't part of Yourdon's
thesis.)  He later recanted the extremist doomsaying, but a quick
review of the fraction of H1B visas granted to Asian-origin
programmers should convince you that USA/EUR/ANZ doesn't have a
monopoly of good-to-great programming (probably never did, but that's
a topic for a different thread).  Also note that in Japan, without
controlling for other factors, just the programming language used most
frequently, Python programmers are the highest paid among developers
in all languages with more than 1% of the sample (and yes, that
includes COBOL!)  To the extent that internationalization matters to a
particular kind of programming, these programmers are better placed
for those jobs, I think.  And while in many cases "on site" has a big
advantage (so you can't telecommute from Bangalore, you need that H1B
which is available in rather restrictive number), more and more
outsourcing does cross oceans so potential competition is immense.

There is a benefit to increasing our internationalization in backward-
incompatible ways.  And that benefit is increasing both in magnitude
and in the number of Python developers who will receive it.

 > I'm curious to know how easy it is for Chinese, Japanese, Korean and
 > Indian programmers to use *ASCII* characters. I have no idea in
 > practice whether the current basically entirely-ASCII nature of
 > programming languages is as much a problem for them

Characters are zero problem for them.  The East Asian national
standards all include the ASCII repertoire, and some device (usually
based on ISO 2022 coding extensions rather than UTF-8) for allowing
ASCII to be one-byte, even if the "local" characters require two or
more bytes.  I forget if India's original national standard also
included an ASCII subset, but they switched over to Unicode quite
early[2], so UTF-8 does the trick for them.  English (the language) is a
much bigger issue.

Most Indians, of course, have little trouble with the derived-from-
English nature of much programming syntax and library identifiers, and
the Asians all get enough training in both (very) basic English and
rote memorization that handling English-derived syntax and library
nomenclature is not a problem.

However, reading and especially creating documentation can be
expensive and inaccurate.  At least in Japanese, "straightforward"
translations are often poor, as nuances are lost.  E.g., a literal
Japanese translation from English requires many words to indicate the
differences a simple "a" vs. "the" vs. "some" indicates in English.
Mostly such nuances can be expressed economically by restructuring a
whole paragraph, but translators rarely bother and often seem unaware
of the issues.  Many Japanese programmers' use of articles is literally
chaotic: it's deterministic but appears random to all but the most
careful analysis.[3]

 > as I imagine Unicode characters would be for me. I really hope it
 > isn't...

I think your imagination is running away with you.  While I understand
how costly it is for those over the age of 12 to develop new habits
(I'm 58, and painfully aware of how frequently I balk at learning
anything new no matter how productivity-enhancing it is likely to be,
and how much more slowly it becomes part of my repertoire), the number
of new things you would need to learn would be few, and frequently
enough used, at least in Python.  It's hard enough to get Guido (and
the other Masters of Pythonic Language Design) to sign on to new ASCII
syntax; even if in principle non-ASCII were to be admitted, I suspect
the barrier there would be even higher.

Most of Unicode is irrelevant to everybody.  Mathematicians use only a
small fraction of the math notation available to them -- it's just
that it's a different small fraction for each field.  The East Asians
need a big chunk (I would guess that educated Chinese and Japanese
encounter about 10,000 characters in "daily life" over a lifetime,
while those encountered at least once a week number about 3000), but
those that need to be memorized are a small minority (less than 5%) of
the already defined Unicode repertoire.

For Western programmers, the mechanics are almost certainly there.
Every personal computer should have at least one font containing all
characters defined in the Basic Multilingual Plane, and most will have
chunks of the astral planes (emoji, rare math symbols, country flags,
...).  Even the Happy Hacker keyboard has enough mode keys (shift,
control, ...) to allow defining "3-finger salutes" for commonly-used
characters not on the keycaps -- in daily life if you don't need a
input method now, you won't need one if Python decides to use WHITE
SQUARE to represent an operation you frequently use -- just an extra
"control key combo" like the editing control keys (eg, for copy, cut,
paste, undo) that aren't marked on any keyboard I have.

I'm *not* advocating *imposing* the necessary effort on anyone right
now.  I just want to reduce the FUD associated with the prospect that
it *might* be imposed on *you*, so that you can evaluate the benefits
in light of the real costs.  They're not zero, but they're unlikely to
ruin your whole day, every day, for months.[4]

"Although sometimes never is better than *right* now" doesn't apply
here. :-)

Footnotes: 
[1]  India is a multiscript country, so faces the same pressure for a
single, internationally accepted character set as the whole world
does, albeit at a lower level.

[2]  "The American Programmer" was the name of Yourdon's consultancy's
newsletter to managers of software projects and software development
organizations.

[3]  Of course the opposite is true when I write Japanese.  In
particular, there's a syntactic component called "particle" (the
closest English equivalent is "preposition", but particles have much
more general roles) that I'm sure my usage is equally chaotic from the
point of view of a native speaker of Japanese -- even after working in
the language for 25 years!  N.B. I'm good enough at the language to
have written grant proposals that were accepted in it -- and still my
usage of particles is unreliable.

[4]  Well, if your role involves teaching other programmers, their
pushback could be a long-lasting irritant. :-(