[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

Larry Hastings larry at hastings.org
Thu May 17 15:01:05 EDT 2018

I fed this into tokenize.tokenize():

    b''' x = "\u1234" '''

I was a bit surprised to see \Uxxxx in the output.  Particularly because 
the output (t.string) was a *string* and not *bytes*.

It turns out, Python's tokenizer ignores escape sequences.  All it does 
is ignore the next character so that \" does the proper thing. But it 
doesn't do any substitutions.  The escape sequences are only handled 
when the AST node is created for the literal string!

Maybe I'm making a parade of my ignorance, but I assumed that string 
literals were parsed by the parser--just like everything else is parsed 
by the parser, hey it seems like a good place for it--and in particular 
that the escape sequence substitutions would be done in the tokenizer.  
Having stared at it a little, I now detect a whiff of "this design 
solved a real problem".  So... what was the problem, and how does this 
design solve it?

BTW, my use case is that I hoped to use CPython's tokenizer to parse 
some Python-ish-looking text and handle double-quoted strings for me.  
*Especially* all the escape sequences--leveraging all CPython's support 
for funny things like \U{penguin}.  The current behavior of the 
tokenizer makes me think it'd be easier to roll my own!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180517/85d055a2/attachment.html>

More information about the Python-Dev mailing list