[Python-Dev] should we keep the \xnnnn escape in unicode strings?
M.-A. Lemburg
mal@lemburg.com
Sat, 15 Jul 2000 18:58:45 +0200
Fredrik Lundh wrote:
>
> as tim pointed out in an earlier thread (on SRE), the
> \xnn escape code is something of a kludge.
>
> I just noted that the unicode string type supports \x
> as well as \u, with slightly different semantics:
>
> \u -- exactly four hexadecimal characters are read.
>
> \x -- 1 or more hexadecimal characters are read, and
> the result is casted to a Py_UNICODE character.
\x is there in Unicode for compatibility with the 8-bit
string implementation and in sync with ANSI C. Guido wanted
these semantics when I asked him about it during the
implementation phase.
> I'm pretty sure this is an optimal design, but I'm not sure
> how it should be changed:
>
> 1. treat \x as a hexadecimal byte, not a hexadecimal
> character. or in other words, make sure that
>
> ord("\xabcd") == ord(u"\xabcd")
>
> fwiw, this is how it's done in SRE's parser (see the
> python-dev archives for more background).
>
> 2. ignore \x. after all, \u is much cleaner.
>
> u"\xabcd" == "\\xabcd"
> u"\u0061" == "\x61" == "\x0061" == "\x00000061"
>
> 3. treat \x as an encoding error.
>
> 4. read no more than 4 characters. (a comment in the
> code says that \x reads 0-4 characters, but the code
> doesn't match that comment)
>
> u"\x0061bcd" == "abcd"
>
> 5. leave it as it is (just fix the comment).
I'd suggest 5 -- makes converting 8-bit strings using \x
to Unicode a tad easier.
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/