[I18n-sig] raw-unicode-escape encoding

David Goodger goodger@users.sourceforge.net
Thu, 07 Mar 2002 21:27:09 -0500


[David Goodger]
> > Note that although the characters are ordinal > 127, they don't
> > get converted into '\\uXXXX' escapes. It seems that the
> > 'raw-unicode-escape' codec is assuming latin-1 for output.

[Martin v. Loewis]
> Correct. raw-unicode-escape brings the Unicode string into a form
> suitable for usage in Python source code. In Python source code,
> bytes in range(128,256) are treated as Latin-1, regardless of your
> system encoding.

That seems contrary to the Python Reference Manual, chapter 2,
`Lexical analysis`__:

    Future compatibility note: It may be tempting to assume that the
    character set for 8-bit characters is ISO Latin-1 ...
    ... it is unwise to assume either Latin-1 or UTF-8, even though
    the current implementation appears to favor Latin-1. This applies
    both to the source character set and the run-time character set.

    __ http://www.python.org/doc/current/ref/lexical.html

"a form suitable for usage in Python source code": that's exactly what
I want. Cross-platform compatibility requires 7-bit ASCII source code.
The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
survive the trip to MacOS.

> > But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?

> I think the raw-unicode-escape codec should be changed to use hex
> escapes for this range.

+1. But '\xa7' or '\u00a7' escapes? Using the former (which the
unicode-escape codec currently does) assumes Latin-1 as the native
encoding. Hex escapes ('\x##') know nothing about the encoding; they
just produce raw bytes. Shouldn't unicode escapes always be of the
'\u####' variety?

For that matter, shouldn't the internal representation distinguish? ::

    >>> u'\u2020\u00a7'
    u'\u2020\xa7'

If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'.

> > Is this a bug? I'll open a bug report if it is. Any workarounds?
>=20
> It is not really a bug. Does it cause problems for you?

Yes. In the Docutils test suite, most of the tests are data-driven
from (input, expected output) pairs. Here's an example::

    # input:
    ["""\
    [#autolabel]_
   =20
    .. [#autolabel] text
    """,
    # expected output (indented pseudo-xml for readability):
    """\
    <document>
        <paragraph>
            <footnote_reference auto=3D"1" refname=3D"autolabel">
                1
        <footnote auto=3D"1" id=3D"id1" name=3D"autolabel">
            <label>
                1
            <paragraph>
                text
    """],

The test takes the input, runs it through the system, and compares it
to the expected output. If there is any difference, the actual &
expected output are run through difflib.Differ().compare() and printed
out.

This works fine for 7-bit strings. If the expected output contains any
unicode, I have to escape it. Fine. There's no problem for ord(char)
>=3D 256, but it breaks for ord(char) >=3D 127. Look at the label of
footnote 4::

    ["""\
    A sequence of symbol footnotes:
   =20
    .. [*] Auto-symbol footnote 1.
    .. [*] Auto-symbol footnote 2.
    .. [*] Auto-symbol footnote 3.
    .. [*] Auto-symbol footnote 4.
    """,
    """\
    <document>
        <paragraph>
            A sequence of symbol footnotes:
        <footnote auto=3D"*" id=3D"id1">
            <label>
                *
            <paragraph>
                Auto-symbol footnote 1.
        <footnote auto=3D"*" id=3D"id2">
            <label>
                \\u2020
            <paragraph>
                Auto-symbol footnote 2.
        <footnote auto=3D"*" id=3D"id3">
            <label>
                \\u2021
            <paragraph>
                Auto-symbol footnote 3.
        <footnote auto=3D"*" id=3D"id4">
            <label>
                =DF
            <paragraph>
                Auto-symbol footnote 4.
    """],

The \xa7 breaks going cross-platform. It doesn't produce a &sect; on
my Mac.

[Marc-Andre Lemburg]
> The unicode-escape codecs (raw and normal) both extend the
> Latin-1 encoding with a few escaped characters.

Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?

> You should first get a feeling for what kind of mapping
> you expect, i.e. which characters should be escaped or not.

I would like to use a codec which escapes all char for ord(char)
from 128 up, but leaves all 7-bit ASCII alone. Is there any such beast?

--=20
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net