[I18n-sig] raw-unicode-escape encoding

Martin v. Loewis martin@v.loewis.de
08 Mar 2002 10:03:34 +0100

David Goodger <goodger@users.sourceforge.net> writes:

> [Martin v. Loewis]
> > Correct. raw-unicode-escape brings the Unicode string into a form
> > suitable for usage in Python source code. In Python source code,
> > bytes in range(128,256) are treated as Latin-1, regardless of your
> > system encoding.
> That seems contrary to the Python Reference Manual, chapter 2,
> `Lexical analysis`__:
>     Future compatibility note: It may be tempting to assume that the
>     character set for 8-bit characters is ISO Latin-1 ...
>     ... it is unwise to assume either Latin-1 or UTF-8, even though
>     the current implementation appears to favor Latin-1. This applies
>     both to the source character set and the run-time character set.

What contradiction do you see? The documentation says it is unwise to
assume anything non-ASCII, and that is certainly the case: it is not
wise to assume that.

I said that bytes above 128 are treated as Latin-1 in the current
implementation, and that is also a fact. Even though this is a fact,
is is not wise to make use of this fact - for example, PEP 263 will
change that; Python 2.3 likely will not assume that bytes above 128
are Latin-1, but will give a warning instead.

> "a form suitable for usage in Python source code": that's exactly what
> I want. Cross-platform compatibility requires 7-bit ASCII source code.
> The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
> survive the trip to MacOS.

If you put it into Python source, it sure does survive the trip to
MacOS - assuming you manage not to convert the Python source file when
putting it on a Mac disk.

You Mac text editor will not display it in the same way as Python
interprets it, but that can't change the way how the current Python
implementation interprets it.

> +1. But '\xa7' or '\u00a7' escapes? Using the former (which the
> unicode-escape codec currently does) assumes Latin-1 as the native
> encoding. Hex escapes ('\x##') know nothing about the encoding; they
> just produce raw bytes. Shouldn't unicode escapes always be of the
> '\u####' variety?

No. That makes absolutely no difference. \xXY, in a Unicode literal,
means "Unicode character with the numeric value 16*X + Y". \uVWXY mean
"Unicode character with the numeric value 4096*V+256*W+16*X+Y" (I
leave defining \UKLMNOPQR as an exercise :-). So if V and W are both
0, then \u00XY is precisely the same as \xXY. No assumption of Latin-1
here, anywhere.

> For that matter, shouldn't the internal representation distinguish? ::
>     >>> u'\u2020\u00a7'
>     u'\u2020\xa7'

No, for the same reason.

> If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'.

You are mistaken.

> Yes. In the Docutils test suite, most of the tests are data-driven
> from (input, expected output) pairs. Here's an example::
> This works fine for 7-bit strings. If the expected output contains any
> unicode, I have to escape it. 

It appears then that the raw-unicode-escape codec is not suited for
your application; you will probably need to write your own. This has
the advantage that you'll know exactly what it is doing.

This is a general principle: The Python "pretty printer" algorithms"
have a specific semantic, suitable for a specific application
(normally: give some kind of output to the user of the interactive
prompt). People try to use those algorithms for different things all
the time, and complain if they don't do what they expect. In general,
the bug is not in Python, but in the application: to use the
algorithm, accept what it does in border cases. 

> [Marc-Andre Lemburg]
> > The unicode-escape codecs (raw and normal) both extend the
> > Latin-1 encoding with a few escaped characters.
> Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?

The unicode-escape codec is not documented at all. How do you know it

> I would like to use a codec which escapes all char for ord(char)
> from 128 up, but leaves all 7-bit ASCII alone. Is there any such
> beast?

"utf-7" fits that description, except that it will escape
'+'. Something that produces ASCII and does not escape anything does
not exist - you normally have to escape the escape character.