[Python-3000] Invalid \U escape in source code give hard-to-trace error
Guido van Rossum
guido at python.org
Wed Jul 18 19:31:53 CEST 2007
On 7/17/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > When a source file contains a string literal with an out-of-range \U
> > escape (e.g. "\U12345678"), instead of a syntax error pointing to the
> > offending literal, I get this, without any indication of the file or
> > line:
> > UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in
> > position 0-9: illegal Unicode character
> > This is quite hard to track down.
> I think the fundamental flaw is that a codec is used to implement
> the Python syntax (or, rather, lexical rules).
> Not quite sure what the rationale for this design was; doing it on
> the lexical level is (was) tricky because \u escapes were allowed
> only for Unicode literals, and the lexer had no knowledge of the
> prefix preceding a literal. (In 3k, it's still similar, because
> \U escapes have no effect in bytes and raw literals).
> Still, even if it is "only" handled at the parsing level, I
> don't see why it needs to be a codec. Instead, implementing
> escapes in the compiler would still allow for proper diagnostics
> (notice that in the AST the original lexical form of the string
> literal is gone).
I guess because it was deemed useful to have a codec for this purpose
too, thereby exposing the algorithm to Python code that needs the same
functionality (e.g. the compiler package, RIP).
> > (Both the location of the bad
> > literal in the source file, and the origin of the error in the parser.
> > :-) Can someone come up with a fix?
> The language definition makes it difficult to fix it where I would
> consider the "proper" place, i.e. in the tokenization:
> says that escapeseq is "\" <any ASCII character>. So
> "\x" is a valid shortstring.
> Then it becomes fuzzy: It says that any unrecognized escape
> sequences are left in the string. While that appears like a clear
> specification, it is not implemented (and has not since Python
> 2.0 anymore). According to the spec, '\U12345678' is well-formed,
> and denotes the same string as '\\U12345678'.
> I now see the following choices:
> 1. Restore implementing the spec again. Stop complaining about
> invalid escapes for \x and \U, and just interpret the \
> as '\\'. In this case, the current design could be left in
> place, and the codecs would just stop raising these errors.
Sounds like a bad idea. I think \xNN (where N is not a hex digit) once
behaved this way, and it was changed to explicitly complain instead as
a service to users.
> 2. Change the spec to make it an error if \x is not followed
> by two hex digits, \u not by four hex digits, \U not by
> 8, or the value denoted by the \U digits is out of range.
> In this case, I would propose to move the lexical analysis
> back into the parser, or just make an internal API that
> will raise a proper SyntaxError (it will be tricky to
> compute the column in the original source line, though).
I'm all in favor of this spec change. Eventually we should change the
lexer to do this right; for now, Kurt's patch is good enough.
> 3. Change the spec to make constrain escapeseq, giving up
> the rule that uninterpreted escapes silently become
> two characters. That's difficult to write down in EBNF,
> so should be formulated through constraints in natural
> language. The lexer would have to keep track of what kind
> of literal it is processing, and reject invalid escapes
> directly on source level.
> There are probably other options as well.
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-3000