On May 14, 2020, at 20:01, Stephen J. Turnbull email@example.com wrote:
AFAICT, my guess at what's going on in the C tokenizer was exactly right. It greedily consumes as many non-operator, non-whitespace characters as possible, then validates.
Well, it like like it’s not quite “non-operator, non-whitespace characters”, but rather “ASCII identifier or non-ASCII characters”:
(c >= 'a' && c <= 'z')\ || (c >= 'A' && c <= 'Z')\ || c == '_'\ || (c >= 128))
(That’s the initial char rule; the continuing char rule is similar but of course allows digits.)
So it won’t treat a $ or a ^G as potentially part of an identifier, so the caret will show up in the right place for one of those, but it will treat an emoji as potentially part of an identifier, so (if that emoji is immediately followed by legal identifier characters, ASCII or otherwise) the caret will show up too far to the right.
I’m still glad the Python tokenizer doesn’t do this (because, as I said, I’ve relied on the documented behavior in import hooks for playing around with Python, and they use the Python tokenizer), but that doesn’t matter for the C tokenizer, because its output is not public, it’s only seen by the parser. And I think you can prove that the error caret placement is the only thing that could be affected by this shortcut. And if it makes the tokenizer faster, or just simpler to maintain, that could easily be worth it.
(At least until one of those periodic “Python should add this Unicode operator” proposals actually gets some traction, but I don’t see that as likely any time soon.)
 Python only allows non-ASCII characters in identifiers, strings, and comments. Therefore, any string of characters that should be tokenized as a sequence of 1 ERRORTOKEN followed by 0 or more NAME and ERRORTOKEN tokens by the documented rule (and the Python code) will still give you a sequence of 1 ERRORTOKEN followed by 0 or more NAME and ERRORTOKEN tokens by the C code, just not necessarily the same such sequence. And any such sequence will be parsed as a SyntaxError pointing at the end of the initial ERRORTOKEN. So, the caret might be somewhere else within that block of identifier and non-ASCII characters, but it will be somewhere within that block.