On 2020-05-11 09:21, Chris Angelico wrote:
On Mon, May 11, 2020 at 6:09 PM Steve Barnes GadgetSteve@live.co.uk wrote:
Actually, in the case of the “wrong quotes” it puts the pointer under the character before the space character or at the end of the line (if you have a fixed spacing font – worse if you don’t) – it still doesn’t tell you which character is invalid.
This is actually a good point.
But it's a different point:
Having an invalid character in an identifier shows the caret at the end of the identifier, regardless of where in the identifier the error is. That's something that could be improved on, regardless of the quote issue. There's a new parser on its way (PEP 617), so it'd be something to consider on that basis.
This isn't a parsing problem as such. I am not an expert on the parser, but what's going is something like this: the parser (tokenizer) sees the character "=" and expects an operator. Next, it sees something that is not "=" and not whitespace, so it expects a literal or an identifier. " “" is not parsable as the start of a literal, so the parser consumes up to the next boundary character (whitespace or operator). Now it checks for the different types of barewords: keywords and identifiers, and neither one works.
Here's the critical point: identifier fails because the tokenizer tries to match a sequence of Unicode word constitituents, and " “" isn't one. So it fails the sequence of non-whitespace characters, and points to the end of the last thing it saw.
So I see no reason why we need to transition to the new parser to fix this. (And the new parser (as of the last comment I saw from Guido) probably doesn't help: he kept the tokenizer.) We just need to make a second pass over the invalid identifier and identify the invalid characters it contains and their positions.
I wouldn't object if the syntax error reported that, say, the wrong type of quote was being used and included something like: Do you mean "?
Wrong kind of quote (not "). Wrong kind of hyphen or minus (-). Etc.
As a permanent resident of Japan, I DEMAND that YOU PERSONALLY implement the SAME TEST for all the Japanese "full-width" operator characters. :-) (This is actually a very common user error, and it's very hard to tell the difference by sight in many fonts, same as directed quotes vs. ASCII quotes in English, but for the whole ASCII repertoire.) This could get really ridiculous.
I think the suggestion that whatever test it is that identified the "invalid character in identifier" defect be fixed to report both the position of the first such character and the list of all such characters is appropriate.
The "wrong kind of quote" stuff belongs elsewhere, and in particular in a linter. Here's an quasi-algorithmic suggestion for that: use the Unicode confusables list (and I think there are many properties such as "related" and "paired" characters that can be indicative). Haven't looked at it in a while; it may not catch all the issues here. But it would be a good start, and quite comprehensive. It might suggest other things linters could be doing, too.