<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <br>

    I fed this into tokenize.tokenize():<br>

    <blockquote><font size="+2"><tt>b''' x = "\u1234" '''</tt></font><br>

    </blockquote>

    I was a bit surprised to see \Uxxxx in the output.  Particularly

    because the output (t.string) was a *string* and not *bytes*.<br>

    <br>

    It turns out, Python's tokenizer ignores escape sequences.  All it

    does is ignore the next character so that \" does the proper thing. 

    But it doesn't do any substitutions.  The escape sequences are only

    handled when the AST node is created for the literal string!<br>

    <br>

    Maybe I'm making a parade of my ignorance, but I assumed that string

    literals were parsed by the parser--just like everything else is

    parsed by the parser, hey it seems like a good place for it--and in

    particular that the escape sequence substitutions would be done in

    the tokenizer.  Having stared at it a little, I now detect a whiff

    of "this design solved a real problem".  So... what was the problem,

    and how does this design solve it?<br>

    <br>

    BTW, my use case is that I hoped to use CPython's tokenizer to parse

    some Python-ish-looking text and handle double-quoted strings for

    me.  *Especially* all the escape sequences--leveraging all CPython's

    support for funny things like \U{penguin}.  The current behavior of

    the tokenizer makes me think it'd be easier to roll my own!<br>

    <br>

    <br>

    <i>/arry</i><br>

  </body>

</html>