Error handling for unknown Unicode characters (was Re: allow `lambda' to be spelled λ)

On 21 July 2016 at 15:08, Rustom Mody <rustompmody@gmail.com> wrote:
OK, thanks for the clarification, and my apologies for jumping on you. I can be a bit hypersensitive on this topic, as my day job sometimes includes encouraging commercial redistributors and end users to stop taking community volunteers for granted and instead help find ways to ensure their work is sustainable :) As it is, I think there are some possible checks that could be added to the code generator pipeline to help clarify matters: - for the "invalid character" error message, we should be able to always report both the printed symbol *and* the ASCII hex escape, rather than assuming the caret will point to the correct place - the caret positioning logic for syntax errors needs to be checked to see if it's currently counting encoded UTF-8 bytes instead of code points (as that will consistently do the wrong thing on a correctly configured UTF-8 terminal) - (more speculatively) when building the symbol table, we may be able to look for identifiers referenced in a namespace that are not NKFC equivalent, but nevertheless qualify as Unicode confusables, and emit a SyntaxWarning (this is speculative, as I'm not sure what degree of performance hit would be associated with it) As far as Danilo's observation regarding the CPython code generator always emitting SyntaxError and SyntaxWarning (regardless of which part of the code generation actually failed) goes, I wouldn't be opposed to our getting more precise about that by defining additional subclasses, but one of the requirements would be for documentation in https://docs.python.org/devguide/compiler.html or helper functions in the source to clearly define "when working on <this> part of the code generation pipeline, raise <that> kind of error if something goes wrong". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 21 July 2016 at 17:41, Nick Coghlan <ncoghlan@gmail.com> wrote:
Prompted by Chris Angelico, I took a closer a look at the behaviour here, and it seems to be due to a problem with the caret being positioned at the end of a candidate "identifier" token, rather than at the beginning:
If you view those examples in a fixed width font, you'll see the caret is pointing at the "t" in each case, rather than at the first problematic code point. (Even in a proportional font, while you can't see the actual alignment, you *can* see that the alignment isn't right) By contrast, if you put an impermissible ASCII character into the "identifier" the caret points right at it. If anyone's inclined to dig into the compilation toolchain to try to figure out what's going on, I filed on issue for this particular misbehaviour at http://bugs.python.org/issue27582 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Apologies for the previous premature send] On 7/21/2016 3:41 AM, Nick Coghlan wrote:
Like reporting errors on the tracker or at least python-list or even here, instead of only in books, articles, posts, other forum questions. I found a few unreported errors on SO questions (and have fixed a couple).
It is not true. https://bugs.python.org/issue25733 All these are (supposedly) possible: SyntaxError -- obvious NameError -- ?, already caught in code module OverflowError-- ?, already caught in code module SystemError - 22 nested for loops ('deeply nested blocks') TypeError -- chr(0), 2.7 ValueError -- chr(0), 3.x; bytes(0), 2.7 Emitting SystemError was changed in 27514. I wish Danilo's observation were true. This issue is about the fact that the code module and IDLE do not catch all possible compile errors because there was no documented list until I compiled the above, which may still not be complete.
Or I wish that the compile doc gave a complete list. -- Terry Jan Reedy

On 21 July 2016 at 17:41, Nick Coghlan <ncoghlan@gmail.com> wrote:
Prompted by Chris Angelico, I took a closer a look at the behaviour here, and it seems to be due to a problem with the caret being positioned at the end of a candidate "identifier" token, rather than at the beginning:
If you view those examples in a fixed width font, you'll see the caret is pointing at the "t" in each case, rather than at the first problematic code point. (Even in a proportional font, while you can't see the actual alignment, you *can* see that the alignment isn't right) By contrast, if you put an impermissible ASCII character into the "identifier" the caret points right at it. If anyone's inclined to dig into the compilation toolchain to try to figure out what's going on, I filed on issue for this particular misbehaviour at http://bugs.python.org/issue27582 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Apologies for the previous premature send] On 7/21/2016 3:41 AM, Nick Coghlan wrote:
Like reporting errors on the tracker or at least python-list or even here, instead of only in books, articles, posts, other forum questions. I found a few unreported errors on SO questions (and have fixed a couple).
It is not true. https://bugs.python.org/issue25733 All these are (supposedly) possible: SyntaxError -- obvious NameError -- ?, already caught in code module OverflowError-- ?, already caught in code module SystemError - 22 nested for loops ('deeply nested blocks') TypeError -- chr(0), 2.7 ValueError -- chr(0), 3.x; bytes(0), 2.7 Emitting SystemError was changed in 27514. I wish Danilo's observation were true. This issue is about the fact that the code module and IDLE do not catch all possible compile errors because there was no documented list until I compiled the above, which may still not be complete.
Or I wish that the compile doc gave a complete list. -- Terry Jan Reedy
participants (2)
-
Nick Coghlan
-
Terry Reedy