[I18n-sig] raw-unicode-escape encoding
Martin v. Loewis
martin@v.loewis.de
07 Mar 2002 08:38:50 +0100
David Goodger <goodger@users.sourceforge.net> writes:
> Note that although the characters are ordinal > 127, they don't get
> converted into '\\uXXXX' escapes. It seems that the
> 'raw-unicode-escape' codec is assuming latin-1 for output.
Correct. raw-unicode-escape brings the Unicode string into a form
suitable for usage in Python source code. In Python source code,
bytes in range(128,256) are treated as Latin-1, regardless of your
system encoding.
> But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?
Your system encoding is (currently) irrelevant how non-ASCII bytes are
interpreted in Python source code; this will change under PEP 263. So
I think the raw-unicode-escape codec should be changed to use hex
escapes for this range.
> Running the string (now an 8-bit string, not 7-bit ASCII) through the
> codec again crashes::
>
> >>> s.encode('raw-unicode-escape')
> Traceback (most recent call last):
> File "<pyshell#13>", line 1, in ?
> s.encode('raw-unicode-escape')
> UnicodeError: ASCII decoding error: ordinal not in range(128)
That's a pilot error: use .decode to decode from some byte string into
a Unicode object. Better yet, use the unicode() builtin.
> Is this because ``s`` is being coerced into a Unicode string, and it
> fails because the default encoding is 'ascii' but ``s`` contains 8-bit
> characters? Do I even have my terminology straight? ;-)
Not in this case, no.
> Is this a bug? I'll open a bug report if it is. Any workarounds?
It is not really a bug. Does it cause problems for you?
Regards,
Martin