[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

13 May 2020

      On May 13, 2020, at 05:31, Richard Damon  wrote:
...
On 5/13/20 2:22 AM, Stephen J. Turnbull wrote:
...
MRAB writes:
...
This isn't a parsing problem as such.  I am not an expert on the
parser, but what's going is something like this: the parser
(tokenizer) sees the character "=" and expects an operator.  Next, it
sees something that is not "=" and not whitespace, so it expects a
literal or an identifier.  " “" is not parsable as the start of a
literal, so the parser consumes up to the next boundary character
(whitespace or operator).  Now it checks for the different types of
barewords: keywords and identifiers, and neither one works.
Here's the critical point: identifier fails because the tokenizer
tries to match a sequence of Unicode word constitituents, and " “"
isn't one.  So it fails the sequence of non-whitespace characters, and
points to the end of the last thing it saw.
But that is the problem, identifier fails too late, it should have seen
at the start that the first character wasn't valid in an identifier, and
failed THERE, pointing at the bad character. There shouldn't be a
post-hoc test for bad characters in the identifier, it should be a
pre-test in the tokenizer.
So I see no reason why we need to transition to the new parser to fix
this.  (And the new parser (as of the last comment I saw from Guido)
probably doesn't help: he kept the tokenizer.)  We just need to make a
second pass over the invalid identifier and identify the invalid
characters it contains and their positions.
There is no need to rescan/reparse, the tokenizer shouldn't treat
illegal characters as possibly part of a token.
Isn’t this what already happens?

    >>> import tokenize, io
    >>> def tok(s): return list(tokenize.tokenize(io.BytesIO(x.encode()).readline))
    >>> tok('spam(“Abc”)')

When I run this in 3.7, the fourth token is an ERRORTOKEN with string ”, then there’s a NAME with Abc, then another ERRORTOKEN with “.

And reading the Lexical Analysis chapter of the docs, this seems correct. The smart quote is not a possible xid_start, or any other start of any token terminal, so it should immediately fail as an error.(The fact that the tokenizer eats it, generates an ERRORTOKEN, and then lexes the Abc as a NAME, rather than throwing an exception or otherwise punting, is a pretty nice error-recovery attempt, and seems perfectly reasonable.)

Is that not true for the internal C tokenizer? Or is it true, but the parser or the error generating code isn’t taking advantage of it?

(By the way. I’m pretty sure this behavior isn’t specific to 3.7, but has been that way back into the mists of whenever you could first write old-style import hooks, even up to the way error recovery works. I’ve taken advantage of this behavior in experimenting with new syntax. If your new syntax is not just unambiguous at the parser level, but even at the lexical level, you can just scan the token stream for your matching ERRORTOKEN.)

[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Andrew Barnert