UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
Tim Peters
tim_one@email.msn.com
Wed, 17 Nov 1999 22:21:16 -0500
[MAL]
> Guido and I have decided to turn \uXXXX into a standard
> escape sequence with no further magic applied. \uXXXX will
> only be expanded in u"" strings.
Does that exclude ur"" strings? Not arguing either way, just don't know
what all this means.
> Here's the new scheme:
>
> With the 'unicode-escape' encoding being defined as:
>
> · all non-escape characters represent themselves as a Unicode ordinal
> (e.g. 'a' -> U+0061).
Same as before (scream if that's wrong).
> · all existing defined Python escape sequences are interpreted as
> Unicode ordinals;
Same as before (ditto).
> note that \xXXXX can represent all Unicode ordinals,
This means that the definition of \xXXXX has changed, then -- as you pointed
out just yesterday <wink>, \xABCDq currently acts like \xCDq. Does the new
\x definition apply only in u"" strings, or in "" strings too? What is the
new \x definition?
> and \OOO (octal) can represent Unicode ordinals up to U+01FF.
Same as before (ditto).
> · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
> error to have fewer than 4 digits after \u.
Same as before (ditto).
IOW, I don't see anything that's changed other than an unspecified new
treatment of \x escapes, and possibly that ur"" strings don't expand \u
escapes.
> Examples:
>
> u'abc' -> U+0061 U+0062 U+0063
> u'\u1234' -> U+1234
> u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+05c
The last example is damaged (U+05c isn't legit). Other than that, these
look the same as before.
> Now how should we define ur"abc\u1234\n" ... ?
If strings carried an encoding tag with them, the obvious answer is that
this acts exactly like r"abc\u1234\n" acts today except gets a
"unicode-escaped" encoding tag instead of a "[whatever the default is
today]" encoding tag.
If strings don't carry an encoding tag with them, you're in a bit of a
pickle: you'll have to convert it to a regular string or a Unicode string,
but in either case have no way to communicate that it may need further
processing; i.e., no way to distinguish it from a regular or Unicode string
produced by any other mechanism. The code I posted yesterday remains my
best answer to that unpleasant puzzle (i.e., produce a Unicode string,
fiddling with backslashes just enough to get the \u escapes expanded, in the
same way Java's (conceptual) preprocessor does it).