[I18n-sig] raw-unicode-escape encoding

M.-A. Lemburg mal@lemburg.com
Thu, 07 Mar 2002 11:52:36 +0100


David Goodger wrote:
>=20
> If this isn't the correct venue, please let me know. (The right people
> seem to be hanging around.)
>=20
> I've come across something strange while adding some Unicode
> characters to the output generated by the Docutils projects (see my
> signature for URLs). I want to get 7-bit ASCII output for the test
> suite, but I want to keep newlines, so I'm using the
> 'raw-unicode-escape' codec. I assumed that this codec would convert
> any character whose ord(char) > 127 to "\\uXXXX". This does not seem
> to be the case for ord(char) between 128 and 255 inclusive.
>=20
> Here's my default encoding::
>=20
>     >>> import sys
>     >>> sys.getdefaultencoding()
>     'ascii'
>=20
> Here's a Unicode string that works::
>=20
>     >>> u =3D u'\u2020\u2021'
>     >>> s =3D u.encode('raw-unicode-escape')
>     >>> s
>     '\\u2020\\u2021'
>     >>> print s
>     \u2020\u2021
>=20
> That's what I want. When I run the string (not Unicode) through the
> codec again, there's no change (which is good)::
>=20
>     >>> s.encode('raw-unicode-escape')
>     '\\u2020\\u2021'
>=20
> Here's a Unicode string that doesn't work::
>=20
>     >>> u =3D u'\u00A7\u00B6'
>     >>> s =3D u.encode('raw-unicode-escape')
>     >>> s
>     '\xa7\xb6'
>     >>> print s
>     =A7=B6
>=20
> (The last line contained the § and ¶ characters, probably
> corrupted.)
>=20
> Note that although the characters are ordinal > 127, they don't get
> converted into '\\uXXXX' escapes. It seems that the
> 'raw-unicode-escape' codec is assuming latin-1 for output. But my
> default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I
> get 7-bit ascii on \u0080 through \u00FF?

The unicode-escape codecs (raw and normal) both extend the
Latin-1 encoding with a few escaped characters. The difference
between the two is mainly in the way they decode escapes; the
raw codec only unescapes a small supset of escapes which the
normal codec can handle.

Both codecs are mainly intended to encode/decode Unicode literals
in Python source code, so their functionality may differ a bit
from what you have in mind.
=20
> The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts
> newlines to '\\n', which I don't want.
>=20
> Running the string (now an 8-bit string, not 7-bit ASCII) through the
> codec again crashes::
>=20
>     >>> s.encode('raw-unicode-escape')
>     Traceback (most recent call last):
>       File "<pyshell#13>", line 1, in ?
>         s.encode('raw-unicode-escape')
>     UnicodeError: ASCII decoding error: ordinal not in range(128)
>=20
> Is this because ``s`` is being coerced into a Unicode string, and it
> fails because the default encoding is 'ascii' but ``s`` contains 8-bit
> characters? Do I even have my terminology straight? ;-)
>=20
> Is this a bug? I'll open a bug report if it is. Any workarounds?

You should first get a feeling for what kind of mapping
you expect, i.e. which characters should be escaped or not.

> I get these results with Python 2.2, on US versions of both Win2K and
> MacOS 8.6. On Win2K I tried this from IDLE and from a Python session
> within GNU Emacs 20.7.1, and on MacOS the test was done using the
> PythonInterpreter app.; identical results all around.

That's intended :-)

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/