[Python-3000] Invalid \U escape in source code give hard-to-trace error

Wed Jul 18 19:31:53 CEST 2007

On 7/17/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > When a source file contains a string literal with an out-of-range \U
> > escape (e.g. "\U12345678"), instead of a syntax error pointing to the
> > offending literal, I get this, without any indication of the file or
> > line:
> >
> > UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in
> > position 0-9: illegal Unicode character
> >
> > This is quite hard to track down.
>
> I think the fundamental flaw is that a codec is used to implement
> the Python syntax (or, rather, lexical rules).
>
> Not quite sure what the rationale for this design was; doing it on
> the lexical level is (was) tricky because \u escapes were allowed
> only for Unicode literals, and the lexer had no knowledge of the
> prefix preceding a literal. (In 3k, it's still similar, because
> \U escapes have no effect in bytes and raw literals).
>
> Still, even if it is "only" handled at the parsing level, I
> don't see why it needs to be a codec. Instead, implementing
> escapes in the compiler would still allow for proper diagnostics
> (notice that in the AST the original lexical form of the string
> literal is gone).

I guess because it was deemed useful to have a codec for this purpose
too, thereby exposing the algorithm to Python code that needs the same
functionality (e.g. the compiler package, RIP).

> > (Both the location of the bad
> > literal in the source file, and the origin of the error in the parser.
> > :-) Can someone come up with a fix?
>
> The language definition makes it difficult to fix it where I would
> consider the "proper" place, i.e. in the tokenization:
>
> http://docs.python.org/ref/strings.html
>
> says that escapeseq is "\" <any ASCII character>. So
> "\x" is a valid shortstring.
>
> Then it becomes fuzzy: It says that any unrecognized escape
> sequences are left in the string. While that appears like a clear
> specification, it is not implemented (and has not since Python
> 2.0 anymore). According to the spec, '\U12345678' is well-formed,
> and denotes the same string as '\\U12345678'.
>
> I now see the following choices:
> 1. Restore implementing the spec again. Stop complaining about
>    invalid escapes for \x and \U, and just interpret the \
>    as '\\'. In this case, the current design could be left in
>    place, and the codecs would just stop raising these errors.

Sounds like a bad idea. I think \xNN (where N is not a hex digit) once
behaved this way, and it was changed to explicitly complain instead as
a service to users.

> 2. Change the spec to make it an error if \x is not followed
>    by two hex digits, \u not by four hex digits, \U not by
>    8, or the value denoted by the \U digits is out of range.
>    In this case, I would propose to move the lexical analysis
>    back into the parser, or just make an internal API that
>    will raise a proper SyntaxError (it will be tricky to
>    compute the column in the original source line, though).

I'm all in favor of this spec change. Eventually we should change the
lexer to do this right; for now, Kurt's patch is good enough.

> 3. Change the spec to make constrain escapeseq, giving up
>    the rule that uninterpreted escapes silently become
>    two characters. That's difficult to write down in EBNF,
>    so should be formulated through constraints in natural
>    language. The lexer would have to keep track of what kind
>    of literal it is processing, and reject invalid escapes
>    directly on source level.

-1

> There are probably other options as well.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)