New subject: [Python-ideas] Error handling for unknown Unicode characters (was Re: allow `lambda' to be spelled λ)

July 21, 2016

      On 21 July 2016 at 15:08, Rustom Mody <rustompmody@gmail.com> wrote:
...
My “wrongheaded” was (intended) quite narrow and technical:
- The embargo on non-ASCII everywhere in the language except identifiers
(strings
  and comments obviously dont count as “in” the language
- The opening of identifiers to large swathes of Unicode widens as you say
  hugely the surface area of attack
This was solely the contradiction I was pointing out.
OK, thanks for the clarification, and my apologies for jumping on you.
I can be a bit hypersensitive on this topic, as my day job sometimes
includes encouraging commercial redistributors and end users to stop
taking community volunteers for granted and instead help find ways to
ensure their work is sustainable :)

As it is, I think there are some possible checks that could be added
to the code generator pipeline to help clarify matters:

- for the "invalid character" error message, we should be able to
always report both the printed symbol *and* the ASCII hex escape,
rather than assuming the caret will point to the correct place
- the caret positioning logic for syntax errors needs to be checked to
see if it's currently counting encoded UTF-8 bytes instead of code
points (as that will consistently do the wrong thing on a correctly
configured UTF-8 terminal)
- (more speculatively) when building the symbol table, we may be able
to look for identifiers referenced in a namespace that are not NKFC
equivalent, but nevertheless qualify as Unicode confusables, and emit
a SyntaxWarning (this is speculative, as I'm not sure what degree of
performance hit would be associated with it)

As far as Danilo's observation regarding the CPython code generator
always emitting SyntaxError and SyntaxWarning (regardless of which
part of the code generation actually failed) goes, I wouldn't be
opposed to our getting more precise about that by defining additional
subclasses, but one of the requirements would be for documentation in
https://docs.python.org/devguide/compiler.html or helper functions in
the source to clearly define "when working on <this> part of the code
generation pipeline, raise <that> kind of error if something goes
wrong".

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia

Error handling for unknown Unicode characters (was Re: allow `lambda' to be spelled λ)

Nick Coghlan

Nick Coghlan

Terry Reedy

Terry Reedy

Nick Coghlan

Terry Reedy

Terry Reedy

tags

participants (2)