[Python-3000] Raw strings containing \u or \U
Ron Adam
rrr at ronadam.com
Fri May 18 18:17:53 CEST 2007
Georg Brandl wrote:
> Ron Adam schrieb:
>> Guido van Rossum wrote:
>>> That would be great! This will automatically turn \u1234 into 6
>>> characters, right?
>> I'm not exactly clear when the '\uxxxx' characters get converted. There
>> isn't any conversion done in tokanize.c that I can see. It's primarily
>> only concerned with finding the beginning and ending of the string at that
>> point. It looks like everything between the beginning and end is just
>> passed along "as is" and it's translated further later in the chain.
>
> Look at Python/ast.c, which has functions parsestr() and decode_unicode().
> The latter calls PyUnicode_DecodeRawUnicodeEscape() which I think is the
> function you're looking for.
>
> Georg
Thanks, I'll look there.
That should be where I need to look to fix a glitch where the last quote of
a raw string is both the end of the string and part of a string.
>>> r'\'
"\\'"
Interestingly it works just fine for raw byte strings. (I wish the letter
were reversed, saying bytes-raw-string is awkward.)
>>> br'\'
b'\\'
Anyway, I've made the corresponding modifications to tokenize.py and
tokenize_tests.txt.
The tests for tokenize.py need to be updated. They do a round trip test,
but I've found that doesn't mean it's the correct round trip!
Cheers,
Ron
More information about the Python-3000
mailing list