[Python-ideas] allow `lambda' to be spelled λ

Thu Jul 21 03:36:24 EDT 2016

Danilo J. S. Bellini writes:

 > 1. Using SyntaxError for lexical errors sounds as strange as saying a
 > misspell/typo is a syntax mistake in a natural language.

Well, I find that many typos are discovered even though they look like
(and often enough are) real words, with unacceptable semantics
(sometimes even the same part of speech).  So I don't find that analogy
at all compelling -- human recognition of typos is far more complex
than computer recognition of parse errors.

And the Python lexer is very simple, even among translators.  It
creates tokens for operators which are more or less self-delimiting,
indentation, strings, and failing that sequences of characters
delimited by spaces, newlines, and operators.  Token recognition is
now complete.  For tokens of as-yet unknown type, it then checks
whether the token is a keyword, if not, is it a number.  If not, in a
syntactically correct program, what's left is an identifier (and I
suppose that's why this error message says "identifier", and why it
points to the end of the token, not the "bad" character).  It then
checks the putative identifier and discovers that the token isn't
well-formed as an identifier.  I think it's a very good idea to keep
this tokenization process simple.

So in my proposal, it's intentionally not a lexical error, but rather
a new kind of self-delimiting token (with no syntactic role in correct
programs).  A lexical error means that the translator failed to
construct (including identifying the syntactic role) a token.  That's
very bad.  Theoretically speaking, that means all bets are off, who
knows what the rest of the program might mean?  Pragmatically, you can
use heuristics to generate error messages and reset the lexer to an
"appropriate" state, but as Nick points out, those heuristics are
unreliable and may do more harm than good, and it's not clear what the
appropriate reset state is.

Making an invalid_token (perhaps a better name for current purposes
would be invalid_character_token) means that there are no lexical
errors (except for UnicodeErrors, but they are "below" the level of
the language definition).  This is consistent with current Python
practice for pure ASCII programs:

>>> a$b
  File "<stdin>", line 1
    a$b
     ^
SyntaxError: invalid syntax

Note that the caret is in the right place, so '$' is being treated as
an operator.  (The same happens with '?', the other non-identifier
non-control ASCII character without specified semantics.)

The advantage is that the tokenized program has much more structure,
and much more restricted valid structure that it can match (correct
positioning of the caret is an immediate benefit, see below), than an
untokenized string (remember, it's already known to contain errors).
Of course you could implicitly do the same thing at the lexical level,
but "explicit is better than implicit".  Since we're trying to reason
about invalid programs, the motivation is heuristic either way, but an
explicit definition of invalid_token means that the processing by the
translator is easier to understand, and it would restrict the ways
that handling of this error could change in the future.  I consider
that restriction to be a good thing in this context, YMMV.

 > 2. About those lexical error messages, the caret is worse than the lack of
 > it when it's not aligned, but unless I'm missing something, one can't
 > guarantee that the terminal is printing the error message with the right
 > encoding.

But it will print the character in the erroneous line and that
character in the error message the same way, which should be enough
(certainly will be enough in the "curly quotes" example).  To identify
the exact character that Python is concerned with (regardless of
whether the glyphs in the error message are what the user sees in her
editor) the Unicode scalar (or even the Unicode name, but that
requires importing the Unicode character database which might be
undesirable) is included.

 > Including the row and column numbers in the message would be
 > helpful.

The line number is already there, the current tokenization process
will set the column number to the place where the caret is.  My
proposal fixes this automatically without requiring Python to do more
analysis than "end of token", which it already knows.

 > 6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having
 > Unicode keywords is merely contingent on Python 2 behavior that
 > emphasized ASCII-only code (besides comments and strings).

It's more than that.  For better or worse, English is the natural
language source for Python keywords (even "elif" is a contraction, and
feels natural to this native speaker), and I can think of no variant
of English where (plausible) candidate keywords can't be spelled with
ASCII.  "lambda" itself is the only plausible exception as far as I
know, and even there "lambda calculus" is perfectly good English now.

 > 7. I really don't like the editor "magic", it would be better to create a
 > packaging/setup.py translation script than that (something like
 > 2to3).

2to3 can be used for this purpose, it's quite flexible about the
rulesets that can be defined and specified.

But note that that implies that adding this capability to the stdlib
would fork the language within the CPython implemention, just as
Python 3 is a fork from Python 2.  That sounds like a bad idea to me
-- some people have always complained that porting to Python 3 is
almost like learning a new language, many people are already
complaining that Python 3 is getting bigger than they like, and it
would impose a burden on other implementations.

 > Still worse are the git hooks to perform the replacement
 > before/after a commit: how should one test a code that uses that? 
 > It somehow feels out of control.

Exactly.  All of this discussion about providing an alias for "lambda"
seems out of control, and as a 20-year veteran of Emacs development
(where there is no way to make a clean distinction between language
and stdlib, apparently nobody has ever heard of TOOWTDI, and 3-line
hacks are regularly committed to the core code), it gives me a
terrifying feeling of deja vu.

Improving the message for invalid identifiers of this particular kind,
OTOH, is a straightforward extension of the existing mechanism.