[Python-3000] Raw strings containing \u or \U

Ron Adam rrr at ronadam.com
Fri May 18 18:17:53 CEST 2007


Georg Brandl wrote:
> Ron Adam schrieb:
>> Guido van Rossum wrote:
>>> That would be great! This will automatically turn \u1234 into 6
>>> characters, right?
>> I'm not exactly clear when the '\uxxxx' characters get converted.  There 
>> isn't any conversion done in tokanize.c that I can see.  It's primarily 
>> only concerned with finding the beginning and ending of the string at that 
>> point.  It looks like everything between the beginning and end is just 
>> passed along "as is" and it's translated further later in the chain.
> 
> Look at Python/ast.c, which has functions parsestr() and decode_unicode().
> The latter calls PyUnicode_DecodeRawUnicodeEscape() which I think is the
> function you're looking for.
> 
> Georg

Thanks, I'll look there.

That should be where I need to look to fix a glitch where the last quote of 
a raw string is both the end of the string and part of a string.

 >>> r'\'
"\\'"

Interestingly it works just fine for raw byte strings.  (I wish the letter 
were reversed, saying bytes-raw-string is awkward.)

 >>> br'\'
b'\\'

Anyway, I've made the corresponding modifications to tokenize.py and 
tokenize_tests.txt.

The tests for tokenize.py need to be updated.  They do a round trip test, 
but I've found that doesn't mean it's the correct round trip!

Cheers,
    Ron







More information about the Python-3000 mailing list