UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Tim Peters tim_one@email.msn.com
Wed, 17 Nov 1999 22:21:16 -0500


[MAL]
> Guido and I have decided to turn \uXXXX into a standard
> escape sequence with no further magic applied. \uXXXX will
> only be expanded in u"" strings.

Does that exclude ur"" strings?  Not arguing either way, just don't know
what all this means.

> Here's the new scheme:
>
> With the 'unicode-escape' encoding being defined as:
>
> · all non-escape characters represent themselves as a Unicode ordinal
>   (e.g. 'a' -> U+0061).

Same as before (scream if that's wrong).

> · all existing defined Python escape sequences are interpreted as
>   Unicode ordinals;

Same as before (ditto).

> note that \xXXXX can represent all Unicode ordinals,

This means that the definition of \xXXXX has changed, then -- as you pointed
out just yesterday <wink>, \xABCDq currently acts like \xCDq.  Does the new
\x definition apply only in u"" strings, or in "" strings too?  What is the
new \x definition?

> and \OOO (octal) can represent Unicode ordinals up to U+01FF.

Same as before (ditto).

> · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
>   error to have fewer than 4 digits after \u.

Same as before (ditto).

IOW, I don't see anything that's changed other than an unspecified new
treatment of \x escapes, and possibly that ur"" strings don't expand \u
escapes.

> Examples:
>
> u'abc'          -> U+0061 U+0062 U+0063
> u'\u1234'       -> U+1234
> u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

The last example is damaged (U+05c isn't legit).  Other than that, these
look the same as before.

> Now how should we define ur"abc\u1234\n"  ... ?

If strings carried an encoding tag with them, the obvious answer is that
this acts exactly like r"abc\u1234\n" acts today except gets a
"unicode-escaped" encoding tag instead of a "[whatever the default is
today]" encoding tag.

If strings don't carry an encoding tag with them, you're in a bit of a
pickle:  you'll have to convert it to a regular string or a Unicode string,
but in either case have no way to communicate that it may need further
processing; i.e., no way to distinguish it from a regular or Unicode string
produced by any other mechanism.  The code I posted yesterday remains my
best answer to that unpleasant puzzle (i.e., produce a Unicode string,
fiddling with backslashes just enough to get the \u escapes expanded, in the
same way Java's (conceptual) preprocessor does it).