On May 13, 2020, at 05:31, Richard Damon Richard@damon-family.org wrote:
On 5/13/20 2:22 AM, Stephen J. Turnbull wrote:
This isn't a parsing problem as such. I am not an expert on the parser, but what's going is something like this: the parser (tokenizer) sees the character "=" and expects an operator. Next, it sees something that is not "=" and not whitespace, so it expects a literal or an identifier. " “" is not parsable as the start of a literal, so the parser consumes up to the next boundary character (whitespace or operator). Now it checks for the different types of barewords: keywords and identifiers, and neither one works.
Here's the critical point: identifier fails because the tokenizer tries to match a sequence of Unicode word constitituents, and " “" isn't one. So it fails the sequence of non-whitespace characters, and points to the end of the last thing it saw.
But that is the problem, identifier fails too late, it should have seen at the start that the first character wasn't valid in an identifier, and failed THERE, pointing at the bad character. There shouldn't be a post-hoc test for bad characters in the identifier, it should be a pre-test in the tokenizer.
So I see no reason why we need to transition to the new parser to fix this. (And the new parser (as of the last comment I saw from Guido) probably doesn't help: he kept the tokenizer.) We just need to make a second pass over the invalid identifier and identify the invalid characters it contains and their positions.
There is no need to rescan/reparse, the tokenizer shouldn't treat illegal characters as possibly part of a token.
Isn’t this what already happens?
>>> import tokenize, io >>> def tok(s): return list(tokenize.tokenize(io.BytesIO(x.encode()).readline)) >>> tok('spam(“Abc”)')
When I run this in 3.7, the fourth token is an ERRORTOKEN with string ”, then there’s a NAME with Abc, then another ERRORTOKEN with “.
And reading the Lexical Analysis chapter of the docs, this seems correct. The smart quote is not a possible xid_start, or any other start of any token terminal, so it should immediately fail as an error.(The fact that the tokenizer eats it, generates an ERRORTOKEN, and then lexes the Abc as a NAME, rather than throwing an exception or otherwise punting, is a pretty nice error-recovery attempt, and seems perfectly reasonable.)
Is that not true for the internal C tokenizer? Or is it true, but the parser or the error generating code isn’t taking advantage of it?
(By the way. I’m pretty sure this behavior isn’t specific to 3.7, but has been that way back into the mists of whenever you could first write old-style import hooks, even up to the way error recovery works. I’ve taken advantage of this behavior in experimenting with new syntax. If your new syntax is not just unambiguous at the parser level, but even at the lexical level, you can just scan the token stream for your matching ERRORTOKEN.)