[Python-Dev] should we keep the \xnnnn escape in unicode strings?

Sat, 15 Jul 2000 18:58:45 +0200

Fredrik Lundh wrote:
> 
> as tim pointed out in an earlier thread (on SRE), the
> \xnn escape code is something of a kludge.
> 
> I just noted that the unicode string type supports \x
> as well as \u, with slightly different semantics:
> 
>     \u -- exactly four hexadecimal characters are read.
> 
>     \x -- 1 or more hexadecimal characters are read, and
>     the result is casted to a Py_UNICODE character.

\x is there in Unicode for compatibility with the 8-bit
string implementation and in sync with ANSI C. Guido wanted
these semantics when I asked him about it during the
implementation phase.

> I'm pretty sure this is an optimal design, but I'm not sure
> how it should be changed:
> 
>     1. treat \x as a hexadecimal byte, not a hexadecimal
>     character.  or in other words, make sure that
> 
>         ord("\xabcd") == ord(u"\xabcd")
> 
>     fwiw, this is how it's done in SRE's parser (see the
>     python-dev archives for more background).
> 
>     2. ignore \x.  after all, \u is much cleaner.
> 
>         u"\xabcd" == "\\xabcd"
>         u"\u0061" == "\x61" == "\x0061" == "\x00000061"
> 
>     3. treat \x as an encoding error.
> 
>     4. read no more than 4 characters.  (a comment in the
>     code says that \x reads 0-4 characters, but the code
>     doesn't match that comment)
> 
>         u"\x0061bcd" == "abcd"
> 
>     5. leave it as it is (just fix the comment).

I'd suggest 5 -- makes converting 8-bit strings using \x
to Unicode a tad easier.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/