![](https://secure.gravatar.com/avatar/7fbdb7bc951086cbf53070677cf272de.jpg?s=120&d=mm&r=g)
1. Using SyntaxError for lexical errors sounds as strange as saying a misspell/typo is a syntax mistake in a natural language. A new "LexicalError" or "TokenizerError" for that makes sense. Perhaps both this new exception and SyntaxError should inherit from a new CompileError class. But the SyntaxError is already covering cases alike with the TabError (an IndentationError), which is a lexical analysis error, not a parser one [1]. To avoid such changes while keeping the name, at least the SyntaxError docstring should be "Compile-time error." instead of "Invalid Syntax.", and the documentation should be explicit that it isn't only about parsing/syntax/grammar but also about lexical analysis errors. 2. About those lexical error messages, the caret is worse than the lack of it when it's not aligned, but unless I'm missing something, one can't guarantee that the terminal is printing the error message with the right encoding. Including the row and column numbers in the message would be helpful. 3. There are people who like and use unicode chars in identifiers. Usually I don't like to translate comments/identifiers to another language, but I did so myself, using variable names with accents in Portuguese for a talk [2], mostly to give it a try. Surprisingly, few people noticed that until I said. The same can be said about Sympy scripts, where symbols like Greek letters would be meaningful (e.g. μ for the mean, σ for the standard deviation and Σ for the covariance matrix), so I'd argue it's quite natural. 4. Unicode have more than one codepoint for some symbols that look alike, for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑", but this one is invalid in Python 3. The italic/bold/serif distinction seems enough for a distinction, and when editing a code with an Unicode char like that, most people would probably copy and paste the symbol instead of typing it, leading to a consistent use of the same symbol. 5. New keywords, no matter whether they fit into the 7-bit ASCII or requires Unicode, unavoidably breaks backwards compatibility at least to some degree. That happened with the "nonlocal" keyword in Python 3, for example. 6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having Unicode keywords is merely contingent on Python 2 behavior that emphasized ASCII-only code (besides comments and strings). 7. The discussion isn't about lambda or anti-lambda bias, it's about keyword naming and readability. Who gains/loses with that resource? It won't hurt those who never uses lambda and never uses Unicode identifiers. Perhaps Sympy users would feel harmed by that, as well as other scientific packages users, but looking for the "λ" char in GitHub I found no one using it alone within Python code. The online Python books written in Greek that I found were using only English identifiers. 8. I don't know if any consensus can emerge in this matter about lambdas, but there's another subject that can be discussed together: macros. What OP wants is exactly a "#define λ lambda", which would be only in the code that uses/needs such symbol with that meaning. A minimal lexical macro that just apply a single keyword token replacement by a identifier-like token is enough for him. I don't know a nice way to do that, something like "from __replace__ import lambda_to_λ" or even "def λ is lambda" would avoid new keywords, but I also don't know how desired this resource is (perhaps to translate the language keywords to another language?). 7. I really don't like the editor "magic", it would be better to create a packaging/setup.py translation script than that (something like 2to3). It's not about coloring/highlighting, nor about editors/IDEs features, it's about seeing the object/file itself, and colors never change that AFAIK. Also, most code I read isn't using my editor, sometimes it comes from cat/diff (terminal stdout output), vim/gedit/pluma (editor), GitHub/BitBucket (web), blogs/forums/e-mails, gitk, Spyder (IDE), etc.. That kind of "view" replacement would compromise some code alignment (e.g. multiline strings/comments) and line length, besides being a problem to look for code with tools like find + grep/sed/awk (which I use all the time). Still worse are the git hooks to perform the replacement before/after a commit: how should one test a code that uses that? It somehow feels out of control. [1] https://docs.python.org/3/reference/lexical_analysis.html [2] http://www.slideshare.net/djsbellini/20140416-garoa-hc-strategy 2016-07-20 13:44 GMT-03:00 Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp>:
Nick Coghlan writes:
The reason that can help is that the main problem with "improving" error messages, is that it can be really hard to tell whether the improvements are actually improvements or not
Personally, I think the real issue here is that the curly quote (and things like mathematical PRIME character) are easily confused with Python syntax and it all looks like grit on Tim's monitor. I tried substituting an emoticon and the DOUBLE INTEGRAL, and it was quite obvious what was wrong from the Python 3 error message.<wink/>
However, in this case, as far as I can tell from the error messages induced by playing with ASCII, Python 3.5 thinks that all non- identifier ASCII characters are syntactic (so for example it says that
with open($file.txt") as f:
is "invalid syntax"). But for non-ASCII characters (I guess including the Latin 1 set?) they are either letters, numerals, or just plain not valid in a Python program AIUI (outside of strings and comments, of course).
I would think the lexer could just treat each invalid character as an invalid_token, which is always invalid in Python syntax, and the error would be a SyntaxError with the message formatted something like
"invalid character {} = U+{:04X}".format(ch, ord(ch))
This should avoid the strange placement of the position indicator, too.
If someday we decide to use an non-ASCII character for a syntactic purpose, that's a big enough compatibility break in itself that changing the invalid character set (and thus the definition of invalid_token) is insignificant.
I'm pretty sure this is what a couple of earlier posters have in mind, too.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Danilo J. S. Bellini --------------- "*It is not our business to set up prohibitions, but to arrive at conventions.*" (R. Carnap)