[Python-3000] PEP 3131 accepted

Wed May 23 13:07:57 CEST 2007

Josiah Carlson writes:

 > From identical character glyph issues (which have been discussed
 > off and on for at least a year),

In my experience, this is not a show-stopping problem.  Emacs/MULE has
had it for 20 years because of the (horrible) design decision to
attach charset information to each character in the representation of
text.  Thus, MULE distinguishes between NO-BREAK SPACE and NO-BREAK
SPACE (the same!) depending on whether the containing text "is" ISO
8859-15 or "is" ISO 8859-1.  (Semantically this is different from the
identical glyph, different character problem, since according to ISO
8859 those characters are identical.  However, as a practical matter,
the problem of detecting and dealing with the situation is the same as
in MULE the character codes are different.)

How does Emacs deal with this?  Simple.  We provide facilities to
identify identical characters (not relevant to PEP 3131, probably), to
highlight suspicious characters (proposed, not actually implemented
AFAIK, since identification does what almost all users want), and to
provide information on characters in the editing buffer.  The
remaining problems with coding confusion are due to deficient
implementation (mea maxima culpa).

I consider this to be an editor/presentation problem, not a language
definition issue.

Note that Ka-Ping's worry about the infinite extensibility of Unicode
relative to any human being's capacity is technically not a problem.
You simply have your editor substitute machine-generated identifiers
for each identifier that contains characters outside of the user's
preferred set (eg, using hex codes to restrict to ASCII), then review
the code.  When you discover what an identifier's semantics are, you
give it a mnemonic name according to the local style guide.
Expensive, yes.  But cost is a management problem, not the kind of
conceptual problem Ka-Ping claims is presented by multilingual
identifiers.  Python is still, in this sense, a finitely generated
language.

 > to editing issues (being that I write and maintain a Python editor)

Multilingual editing (except for non-LTR scripts) is pretty much a
solved problem, in theory, although adding it to any given
implementation can be painful.  However, since there are many
programmer's editors that can handle multilingual text already, that
is not a strong argument against PEP 3131.

 > Yes, PEP 3131 makes writing software in Python easier for some, but for
 > others, it makes maintenance of 3rd party code a potential nightmare
 > (regardless of 'community standards' to use ascii identifiers).

Yes, there are lots of nightmares.  In over 15 years of experience
with multilingual identifiers, I can't recall any that have lasted
past the break of dawn, though.

I just don't see such identifiers very often, and when I do, they are
never hard to deal with.  Admittedly, I don't ever need to deal with
Arabic or Devanagari or Thai, but I'd be willing to bet I could deal
with identifiers in those languages, as long as the syntax is ASCII.

As for third party code, "the doctor says that if you put down that
hammer, your head will stop hurting".  If multilingual third party
code looks like a maintenance risk, don't deal with that third
party.[1]  Or budget for translation up front; translators are quite a
bit cheaper than programmers.

BTW, "find . -name '*.py' | xargs grep -l '[^[:ascii:]]'" is a pretty
cheap litmus test for your software vendors!  And yes, it *should* be
looking into strings and comments.  In practice (once I acquired a
multilingual editor), handling non-English strings and comments has
been 99% of the headache of maintaining code that contains non-ASCII.

I've been maintaining the edict.el library, an interface to Jim
Breen's Japanese-English dictionary EDICT for XEmacs for 10 years
(there was serious development activity for only about the first 2,
though).  A large fraction of the identifiers specific to that library
contain Japanese characters (both ideographic kanji and syllabic kana,
as well as the pseudo-namespace prefix "edict-" in ASCII).  There are
several Japanese identifiers in there whose meaning I still don't
know, except by referring to the code to see what it does (they're
technical terms in Japanese linguistics, I believe, and probably about
as intelligible to the layman as terms in Dutch tax law).  At the time
I started maintaining that library, I did so because I *couldn't read
Japanese* (obviously!)

This turned out to pose no problem.  Japanese identifiers were *not*
visually distinct to me, but when I needed to analyze a function, I
became familiar with the glyphs of related identifiers quickly.  And
having an intelligible name to start with wouldn't have helped much; I
needed to analyze the function because it wasn't doing what I wanted
it to do, not because I couldn't translate the name.

There are other packages in XEmacs which use non-ASCII, non-English
identifiers, but they are rare.  Maintaining them has never been
reported as a problem.

N.B.  This is limited experience with what many might characterize as
a niche language.  And I'm an idiosyncratic individual, blessed with a
reasonable amount of talent at language learning.  Both valid points.

However, I think the killer point in the above is the one about
strings and comments.  If you can discipline your team to write
comments and strings in ASCII/English, extending that to identifiers
is no problem.  If your team insists on multilingual strings/comments,
or needs them due to the task, multilingual identifiers will be the
least of your problems, and the most susceptible to technical solution
(eg, via identification and quarantine by cross-reference tables).

Granted, this is going to be a more or less costly transition for
ASCII-only Pythonistas.  I think we should focus on cost-reduction,
not on why it shouldn't happen.

Footnotes: 
[1]  Yes, I know, in the real world sometimes you have to.
Multilingual identifiers are the least of your worries when dealing
with a monopoly supplier.