Executive summary: AFAICT, my guess at what's going on in the C tokenizer was exactly right. It greedily consumes as many non-operator, non-whitespace characters as possible, then validates. It does this because it is tokenizing a stream of bytes encoding characters as UTF-8. Andrew Barnert via Python-ideas writes:
Is that not true for the internal C tokenizer? Or is it true, but the parser or the error generating code isn’t taking advantage of it?
It would be bizarre if true. Why would the error reporting randomly take an invalid character and glom it on to the following characters to create an invalid identifier, then report that? I suspect that the Python version is a tiny bit smarter than the C version because it naturally processes (Unicode) characters while the C code processes (UTF-8) bytes by design (from the now-ancient PEP 263), but the Python code is left as an exercise for the interested reader. ;-) Here's the relevant part of tokenizer.c:tok_get from Python 3.8 (all comments are mine, except for part of the comment about processing bfru strings): /* Note note note: "character" seems to mean C char, ie, byte! This is just from the declaration in struct tok_state, I haven't carefully confirmed that the program text being tokenized is UTF-8 bytes but that's what PEP 263 says to do, and it looks like that's what the I/O code is doing. Which is true doesn't matter to my analysis because a UTF-8 byte c is part of a non-ASCII character if and only if c >= 128, while a Unicode character c is non-ASCII if and only if c >= 128. Identifier consumption stops only when c is ASCII, or EOF; it can only stop on a UTF-8 character boundary. So this algorithm works exactly the same whether you use UTF-8-encoded bytes or Unicode characters. is_potential_identifier_start includes letters, underscore, and ALL non-ASCII. is_potential_identifier_char includes all of those, plus digits. */ /* I suspect the Python code uses an accurate test here, rather than these accurate-for-ASCII-not-so-for-non-ASCII tests. */ /* l. 24 */ #define is_potential_identifier_start(c) (\ (c >= 'a' && c <= 'z')\ || (c >= 'A' && c <= 'Z')\ || c == '_'\ || (c >= 128)) #define is_potential_identifier_char(c) (\ (c >= 'a' && c <= 'z')\ || (c >= 'A' && c <= 'Z')\ || (c >= '0' && c <= '9')\ || c == '_'\ || (c >= 128)) /* Skip 1000+ lines of I/O code. */ /* l. 1368 */ static int tok_get(struct tok_state *tok, char **p_start, char **p_end) { /* Skip initialization and handling of indentation, whitespace, and comments. */ /* We attempt to parse an identifier as the first guess. We start with code that handles string prefixes "bfru". Otherwise we just consume potential identifier characters until we run into a character (byte) that is not a potential identifier character. If any character is not ASCII, set nonascii flag. */ /* l. 1492 */ nonascii = 0; if (is_potential_identifier_start(c)) { while (1) { /* Process the various legal combinations of b"", r"", u"", and f"". Complicated multibranch if-else-if ... statement omitted. If none, break out of while before getting next c. */ c = tok_nextc(tok); if (c == '"' || c == '\'') { goto letter_quote; } } /* If we get here, we may have seen some of bfru, but it's not legal string syntax, so continue trying to extract an identifier. In particular, if the first character c was non-ASCII, we broke out of the while loop doing nothing, so c is still that non-ASCII character. */ while (is_potential_identifier_char(c)) { if (c >= 128) { nonascii = 1; } c = tok_nextc(tok); } /* Last thing we saw was not part of the potential identifier. Unget it. */ tok_backup(tok, c); /* If this is a PGEN build, verify_identifier always returns true, because PGEN doesn't have access to Python's Unicode routines. That would necessarily have to check valid identifier after returning the token stream. Otherwise verify_identifier validates the string using PyUnicode_IsIdentifier. */ if (nonascii && !verify_identifier(tok)) { return ERRORTOKEN; } So there you are.
(By the way. I’m pretty sure this behavior isn’t specific to 3.7,
As mentioned above, this code is from 3.8, and the algorithm (transcode program text to UTF-8 and process bytewise, using the fact that all characters Python has special knowledge of are ASCII) is specified in PEP 263. Steve