[I18n-sig] raw-unicode-escape encoding

Martin v. Loewis martin@v.loewis.de
07 Mar 2002 08:38:50 +0100


David Goodger <goodger@users.sourceforge.net> writes:

> Note that although the characters are ordinal > 127, they don't get
> converted into '\\uXXXX' escapes. It seems that the
> 'raw-unicode-escape' codec is assuming latin-1 for output. 

Correct. raw-unicode-escape brings the Unicode string into a form
suitable for usage in Python source code. In Python source code,
bytes in range(128,256) are treated as Latin-1, regardless of your
system encoding.

> But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?

Your system encoding is (currently) irrelevant how non-ASCII bytes are
interpreted in Python source code; this will change under PEP 263. So
I think the raw-unicode-escape codec should be changed to use hex
escapes for this range.

> Running the string (now an 8-bit string, not 7-bit ASCII) through the
> codec again crashes::
> 
>     >>> s.encode('raw-unicode-escape')
>     Traceback (most recent call last):
>       File "<pyshell#13>", line 1, in ?
>         s.encode('raw-unicode-escape')
>     UnicodeError: ASCII decoding error: ordinal not in range(128)

That's a pilot error: use .decode to decode from some byte string into
a Unicode object. Better yet, use the unicode() builtin.

> Is this because ``s`` is being coerced into a Unicode string, and it
> fails because the default encoding is 'ascii' but ``s`` contains 8-bit
> characters? Do I even have my terminology straight? ;-)

Not in this case, no.

> Is this a bug? I'll open a bug report if it is. Any workarounds?

It is not really a bug. Does it cause problems for you?

Regards,
Martin