[Python-3000] PEP 3131 accepted

Wed May 23 09:37:14 CEST 2007

I can see that I don't stand a very high chance of convincing you.
But I'd like to make sure you understand what I'm getting at, anyway.
(And I will get to some specific suggestions at the end of this
message.)

The key thing is that the language definition is about to transition
from something which has always "fit in your head", and which holds
that property as a core value, to something which cannot possibly fit
in anyone's head no matter how hard they try.  (This core value of
Python is not something I see as having been a core value of Java,
and it's one of the reasons I like Python better.)

> > PEP 3131 will also cause problems for code review.  Because many
> > characters have indistinguishable appearances, there will be no
> > mapping between what you see when you look at code and what the code
> > actually says.
>
> I trust most programmers to *want* to write clear code, so they will
> steer clear from such things. If someone wants to obfuscate their code
> they already have plenty of opportunities (even in Python!).

Indeed -- but that's not an argument for creating more opportunities.
For example, we like the fact that Python doesn't look like Perl; the
mere fact that some kinds of obfuscation are possible in Python
doesn't require us to give up on simplicity entirely and open the door
to a Perl-like proliferation of operators.

Not all programmers want to write clear code; from a security
perspective, the most important programmers are the ones who have an
incentive to fool you.  Unicode identifiers are a new avenue for any
insider who wants to use a Python program as a vector of attack; they
enable changes that are harder to detect, track down, and understand.

> The problem is no worse than the lack of difference between 1 and l in
> some fonts, and between l and I in others (and there are even fonts
> where o and 0 look the same).

It's far, far worse.  The number of ways in which characters can be
confused in Unicode is much greater.  There are many fonts you can
choose from that offer a clear visual difference between 1 and l and I,
whereas there are no fonts in the world that distinguish all the
identifier characters in Unicode.  More importantly, there probably
never will be.  It's not just incrementally harder to identify
characters; Unicode intends to make it impossible by design.

> Remember the mantra that *human* readability of code is
> important? Well, it helps if your code can use at least some the
> language spoken by those humans.

Yes, a programming language is a communication medium among humans
and computers.  If you look at this as a communication medium, the
problem is that we're losing round-trip ability to human-readable media.

Suppose I hand you a printout of a Python program for you to review.
One of the questions you are faced with answering is, "Is this a valid
Python program?"  But your answer will necessarily be "I don't know",
for almost any program.  "I cannot possibly know" will be the only
truthful answer anyone can give.

Or suppose you are reading a book about Python and it shows you a bit
of code.  You want to type in the example -- but you cannot be sure
what you should type.

I don't deny that there is some convenience to be gained by those who
prefer to use other human languages when discussing and writing
programs.  But there is an extremely high cost to the language
definition.  With this definitional change, every Python program
that is displayed on a screen or printed on paper (or, in fact, in
any human-accessible representation) instantly becomes untrustworthy.

Another way to look at it is the computer science definition of a
language: what a language specifies is the set of acceptable programs.
So the purpose of a language is to restrict: to define the boundary
between what is in the language and what is not in the language.  But
that's just syntax; in addition, programming languages have semantics,
so the other half of the purpose is to give programs meaning for the
people who read them and construct compilers, interpreters, etc.  If
you put these two things together you get:

    The purpose of a programming language is to restrict the set
    of acceptable programs to a set that is small enough and simple
    enough that humans can agree on a clear meaning to each program.

Maybe this will help you see why I am so concerned about PEP 3131 --
in my judgement, it violates the fundamental purpose of a programming
language.  The big difference between natural languages and
programming languages is that it's okay for natural languages to be
fuzzy, but programs need to have exactly one meaning because they're
supposed to be operational.

                    *           *           *

Okay.  I've said my arguments, and I hope they will convince you.

But I recognize that they may not.  And if so, I have a couple of
suggestions for you to consider that might help address my concerns.

First: the "Common Objections" section of the PEP is too thin.  I'd
like the following arguments to be mentioned there for the record:

    1.  Python will lose the ability to make a reliable round trip
        between a computer file and any human-accessible medium
        such as a visual display or a printed page.

    2.  Python will become vulnerable to a new class of security
        exploits via the writing of misleading or malicious code
        that is visually indistinguishable from correct code.
        Consequently it will be more difficult for humans to
        inspect code and assure its correctness or trustworthiness.
        There is very little established best practice for
        addressing homograph security issues.

    3.  The Python language will become too large for any single
        person to fully know, in the sense that no human being can
        know the full character set, and therefore no one can ever
        acquire the ability to independently examine a program and
        decide whether it is valid Python.

    4.  Python programs that reuse other Python modules may come
        to contain a mix of character sets such that no one can
        fully read them or properly display them.

    5.  Unicode is young and unfinished.  As far as I know there
        are no truly complete Unicode fonts and there may not be
        for some time.  Tool support is weak.  The whole computer
        industry has 40 years of experience working with ASCII
        for everything, including programming languages; our
        experience with Unicode security issues and Unicode in
        programming languages is fairly immature.

Second: we need a way to be sure about the programs we're running.
So let the acceptance of Unicode identifiers be controlled by a
command-line flag, e.g. "python -U" accepts them, "python" alone
does not.  And let's keep the code for this feature clearly separated
so that one can be sure, with high confidence, that when this feature
is turned off, none of the code for Unicode identifiers will be
touched.  It should be possible to compile a Python that is incapable
of supporting Unicode identifiers.

Then people who want to use non-ASCII identifiers can do so, and
anyone can still run their programs if they want.  At the same time,
people who want to know exactly what their programs say can be
confident that Python is working with a small and manageable character
set.  And people who don't know or don't care about this change won't
suddenly have a whole new source of surprises thrust upon them; if
they know enough to know they want this feature, they can ask for it.

If we're going to introduce a significant new source of complexity,
let's at least make it easy to keep things simple (and reliably
simple) for those who want to do so; we can expect this to be the vast
majority, given interoperability and extensibility concerns, existing
industry practices, and the policy for the Python standard library.

What do you think?

-- ?!ng