[I18n-sig] raw-unicode-escape encoding

Wed, 06 Mar 2002 21:02:56 -0500

If this isn't the correct venue, please let me know. (The right people
seem to be hanging around.)

I've come across something strange while adding some Unicode
characters to the output generated by the Docutils projects (see my
signature for URLs). I want to get 7-bit ASCII output for the test
suite, but I want to keep newlines, so I'm using the
'raw-unicode-escape' codec. I assumed that this codec would convert
any character whose ord(char) > 127 to "\\uXXXX". This does not seem
to be the case for ord(char) between 128 and 255 inclusive.

Here's my default encoding::

    >>> import sys
    >>> sys.getdefaultencoding()
    'ascii'

Here's a Unicode string that works::

    >>> u =3D u'\u2020\u2021'
    >>> s =3D u.encode('raw-unicode-escape')
    >>> s
    '\\u2020\\u2021'
    >>> print s
    \u2020\u2021

That's what I want. When I run the string (not Unicode) through the
codec again, there's no change (which is good)::

    >>> s.encode('raw-unicode-escape')
    '\\u2020\\u2021'

Here's a Unicode string that doesn't work::

    >>> u =3D u'\u00A7\u00B6'
    >>> s =3D u.encode('raw-unicode-escape')
    >>> s
    '\xa7\xb6'
    >>> print s
    =A7=B6

(The last line contained the &sect; and &para; characters, probably
corrupted.)

Note that although the characters are ordinal > 127, they don't get
converted into '\\uXXXX' escapes. It seems that the
'raw-unicode-escape' codec is assuming latin-1 for output. But my
default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I
get 7-bit ascii on \u0080 through \u00FF?

The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts
newlines to '\\n', which I don't want.

Running the string (now an 8-bit string, not 7-bit ASCII) through the
codec again crashes::

    >>> s.encode('raw-unicode-escape')
    Traceback (most recent call last):
      File "<pyshell#13>", line 1, in ?
        s.encode('raw-unicode-escape')
    UnicodeError: ASCII decoding error: ordinal not in range(128)

Is this because ``s`` is being coerced into a Unicode string, and it
fails because the default encoding is 'ascii' but ``s`` contains 8-bit
characters? Do I even have my terminology straight? ;-)

Is this a bug? I'll open a bug report if it is. Any workarounds?

I get these results with Python 2.2, on US versions of both Win2K and
MacOS 8.6. On Win2K I tried this from IDLE and from a Python session
within GNU Emacs 20.7.1, and on MacOS the test was done using the
PythonInterpreter app.; identical results all around.

--=20
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net