
Danilo J. S. Bellini writes:
1. Using SyntaxError for lexical errors sounds as strange as saying a misspell/typo is a syntax mistake in a natural language.
Well, I find that many typos are discovered even though they look like (and often enough are) real words, with unacceptable semantics (sometimes even the same part of speech). So I don't find that analogy at all compelling -- human recognition of typos is far more complex than computer recognition of parse errors. And the Python lexer is very simple, even among translators. It creates tokens for operators which are more or less self-delimiting, indentation, strings, and failing that sequences of characters delimited by spaces, newlines, and operators. Token recognition is now complete. For tokens of as-yet unknown type, it then checks whether the token is a keyword, if not, is it a number. If not, in a syntactically correct program, what's left is an identifier (and I suppose that's why this error message says "identifier", and why it points to the end of the token, not the "bad" character). It then checks the putative identifier and discovers that the token isn't well-formed as an identifier. I think it's a very good idea to keep this tokenization process simple. So in my proposal, it's intentionally not a lexical error, but rather a new kind of self-delimiting token (with no syntactic role in correct programs). A lexical error means that the translator failed to construct (including identifying the syntactic role) a token. That's very bad. Theoretically speaking, that means all bets are off, who knows what the rest of the program might mean? Pragmatically, you can use heuristics to generate error messages and reset the lexer to an "appropriate" state, but as Nick points out, those heuristics are unreliable and may do more harm than good, and it's not clear what the appropriate reset state is. Making an invalid_token (perhaps a better name for current purposes would be invalid_character_token) means that there are no lexical errors (except for UnicodeErrors, but they are "below" the level of the language definition). This is consistent with current Python practice for pure ASCII programs:
a$b File "<stdin>", line 1 a$b ^ SyntaxError: invalid syntax
Note that the caret is in the right place, so '$' is being treated as an operator. (The same happens with '?', the other non-identifier non-control ASCII character without specified semantics.) The advantage is that the tokenized program has much more structure, and much more restricted valid structure that it can match (correct positioning of the caret is an immediate benefit, see below), than an untokenized string (remember, it's already known to contain errors). Of course you could implicitly do the same thing at the lexical level, but "explicit is better than implicit". Since we're trying to reason about invalid programs, the motivation is heuristic either way, but an explicit definition of invalid_token means that the processing by the translator is easier to understand, and it would restrict the ways that handling of this error could change in the future. I consider that restriction to be a good thing in this context, YMMV.
2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding.
But it will print the character in the erroneous line and that character in the error message the same way, which should be enough (certainly will be enough in the "curly quotes" example). To identify the exact character that Python is concerned with (regardless of whether the glyphs in the error message are what the user sees in her editor) the Unicode scalar (or even the Unicode name, but that requires importing the Unicode character database which might be undesirable) is included.
Including the row and column numbers in the message would be helpful.
The line number is already there, the current tokenization process will set the column number to the place where the caret is. My proposal fixes this automatically without requiring Python to do more analysis than "end of token", which it already knows.
6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having Unicode keywords is merely contingent on Python 2 behavior that emphasized ASCII-only code (besides comments and strings).
It's more than that. For better or worse, English is the natural language source for Python keywords (even "elif" is a contraction, and feels natural to this native speaker), and I can think of no variant of English where (plausible) candidate keywords can't be spelled with ASCII. "lambda" itself is the only plausible exception as far as I know, and even there "lambda calculus" is perfectly good English now.
7. I really don't like the editor "magic", it would be better to create a packaging/setup.py translation script than that (something like 2to3).
2to3 can be used for this purpose, it's quite flexible about the rulesets that can be defined and specified. But note that that implies that adding this capability to the stdlib would fork the language within the CPython implemention, just as Python 3 is a fork from Python 2. That sounds like a bad idea to me -- some people have always complained that porting to Python 3 is almost like learning a new language, many people are already complaining that Python 3 is getting bigger than they like, and it would impose a burden on other implementations.
Still worse are the git hooks to perform the replacement before/after a commit: how should one test a code that uses that? It somehow feels out of control.
Exactly. All of this discussion about providing an alias for "lambda" seems out of control, and as a 20-year veteran of Emacs development (where there is no way to make a clean distinction between language and stdlib, apparently nobody has ever heard of TOOWTDI, and 3-line hacks are regularly committed to the core code), it gives me a terrifying feeling of deja vu. Improving the message for invalid identifiers of this particular kind, OTOH, is a straightforward extension of the existing mechanism.