[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

15 May 2020

      Executive summary:

AFAICT, my guess at what's going on in the C tokenizer was exactly
right.  It greedily consumes as many non-operator, non-whitespace
characters as possible, then validates.  It does this because it is
tokenizing a stream of bytes encoding characters as UTF-8.

Andrew Barnert via Python-ideas writes:
...
Is that not true for the internal C tokenizer? Or is it true, but
the parser or the error generating code isn’t taking advantage of
it?
It would be bizarre if true.  Why would the error reporting randomly
take an invalid character and glom it on to the following characters
to create an invalid identifier, then report that?  I suspect that the
Python version is a tiny bit smarter than the C version because it
naturally processes (Unicode) characters while the C code processes
(UTF-8) bytes by design (from the now-ancient PEP 263), but the Python
code is left as an exercise for the interested reader. ;-)

Here's the relevant part of tokenizer.c:tok_get from Python 3.8 (all
comments are mine, except for part of the comment about processing
bfru strings):

/* Note note note: "character" seems to mean C char, ie, byte!
   This is just from the declaration in struct tok_state, I haven't
   carefully confirmed that the program text being tokenized is UTF-8
   bytes but that's what PEP 263 says to do, and it looks like that's
   what the I/O code is doing.
   Which is true doesn't matter to my analysis because a UTF-8 byte c
   is part of a non-ASCII character if and only if c >= 128, while a
   Unicode character c is non-ASCII if and only if c >= 128.
   Identifier consumption stops only when c is ASCII, or EOF; it can
   only stop on a UTF-8 character boundary.  So this algorithm works
   exactly the same whether you use UTF-8-encoded bytes or Unicode
   characters.

   is_potential_identifier_start includes letters, underscore, and ALL
   non-ASCII.
   is_potential_identifier_char includes all of those, plus digits. */

/* I suspect the Python code uses an accurate test here, rather than
   these accurate-for-ASCII-not-so-for-non-ASCII tests. */
/* l. 24 */
#define is_potential_identifier_start(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || c == '_'\
               || (c >= 128))

#define is_potential_identifier_char(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || (c >= '0' && c <= '9')\
               || c == '_'\
               || (c >= 128))

/* Skip 1000+ lines of I/O code. */

/* l. 1368 */
static int
tok_get(struct tok_state *tok, char **p_start, char **p_end)
{

/* Skip initialization and handling of indentation, whitespace, and
   comments. */

/* We attempt to parse an identifier as the first guess.
   We start with code that handles string prefixes "bfru".
   Otherwise we just consume potential identifier characters until we
   run into a character (byte) that is not a potential identifier
   character.  If any character is not ASCII, set nonascii flag. */

/* l. 1492 */
    nonascii = 0;
    if (is_potential_identifier_start(c)) {
        while (1) {

            /* Process the various legal combinations of b"", r"", u"",
               and f"".  Complicated multibranch if-else-if ...
               statement omitted.  If none, break out of while before
               getting next c. */

            c = tok_nextc(tok);
            if (c == '"' || c == '\'') {
                goto letter_quote;
            }
        }
        /* If we get here, we may have seen some of bfru, but it's not
           legal string syntax, so continue trying to extract an
           identifier.  In particular, if the first character c was
           non-ASCII, we broke out of the while loop doing nothing, so
           c is still that non-ASCII character. */
        while (is_potential_identifier_char(c)) {
            if (c >= 128) {
                nonascii = 1;
            }
            c = tok_nextc(tok);
        }

        /* Last thing we saw was not part of the potential identifier.
           Unget it. */
        tok_backup(tok, c);

        /* If this is a PGEN build, verify_identifier always returns
           true, because PGEN doesn't have access to Python's Unicode
           routines.  That would necessarily have to check valid
           identifier after returning the token stream.
           Otherwise verify_identifier validates the string using
           PyUnicode_IsIdentifier. */
        if (nonascii && !verify_identifier(tok)) {
            return ERRORTOKEN;
        }

So there you are.
...
(By the way. I’m pretty sure this behavior isn’t specific to 3.7,
As mentioned above, this code is from 3.8, and the algorithm
(transcode program text to UTF-8 and process bytewise, using the fact
that all characters Python has special knowledge of are ASCII) is
specified in PEP 263.

Steve

[Python-ideas] Re: Improve handling of Unicode quotes and hyphens

Stephen J. Turnbull