[I18n-sig] raw-unicode-escape encoding
David Goodger
goodger@users.sourceforge.net
Thu, 07 Mar 2002 21:27:09 -0500
[David Goodger]
> > Note that although the characters are ordinal > 127, they don't
> > get converted into '\\uXXXX' escapes. It seems that the
> > 'raw-unicode-escape' codec is assuming latin-1 for output.
[Martin v. Loewis]
> Correct. raw-unicode-escape brings the Unicode string into a form
> suitable for usage in Python source code. In Python source code,
> bytes in range(128,256) are treated as Latin-1, regardless of your
> system encoding.
That seems contrary to the Python Reference Manual, chapter 2,
`Lexical analysis`__:
Future compatibility note: It may be tempting to assume that the
character set for 8-bit characters is ISO Latin-1 ...
... it is unwise to assume either Latin-1 or UTF-8, even though
the current implementation appears to favor Latin-1. This applies
both to the source character set and the run-time character set.
__ http://www.python.org/doc/current/ref/lexical.html
"a form suitable for usage in Python source code": that's exactly what
I want. Cross-platform compatibility requires 7-bit ASCII source code.
The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
survive the trip to MacOS.
> > But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?
> I think the raw-unicode-escape codec should be changed to use hex
> escapes for this range.
+1. But '\xa7' or '\u00a7' escapes? Using the former (which the
unicode-escape codec currently does) assumes Latin-1 as the native
encoding. Hex escapes ('\x##') know nothing about the encoding; they
just produce raw bytes. Shouldn't unicode escapes always be of the
'\u####' variety?
For that matter, shouldn't the internal representation distinguish? ::
>>> u'\u2020\u00a7'
u'\u2020\xa7'
If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'.
> > Is this a bug? I'll open a bug report if it is. Any workarounds?
>=20
> It is not really a bug. Does it cause problems for you?
Yes. In the Docutils test suite, most of the tests are data-driven
from (input, expected output) pairs. Here's an example::
# input:
["""\
[#autolabel]_
=20
.. [#autolabel] text
""",
# expected output (indented pseudo-xml for readability):
"""\
<document>
<paragraph>
<footnote_reference auto=3D"1" refname=3D"autolabel">
1
<footnote auto=3D"1" id=3D"id1" name=3D"autolabel">
<label>
1
<paragraph>
text
"""],
The test takes the input, runs it through the system, and compares it
to the expected output. If there is any difference, the actual &
expected output are run through difflib.Differ().compare() and printed
out.
This works fine for 7-bit strings. If the expected output contains any
unicode, I have to escape it. Fine. There's no problem for ord(char)
>=3D 256, but it breaks for ord(char) >=3D 127. Look at the label of
footnote 4::
["""\
A sequence of symbol footnotes:
=20
.. [*] Auto-symbol footnote 1.
.. [*] Auto-symbol footnote 2.
.. [*] Auto-symbol footnote 3.
.. [*] Auto-symbol footnote 4.
""",
"""\
<document>
<paragraph>
A sequence of symbol footnotes:
<footnote auto=3D"*" id=3D"id1">
<label>
*
<paragraph>
Auto-symbol footnote 1.
<footnote auto=3D"*" id=3D"id2">
<label>
\\u2020
<paragraph>
Auto-symbol footnote 2.
<footnote auto=3D"*" id=3D"id3">
<label>
\\u2021
<paragraph>
Auto-symbol footnote 3.
<footnote auto=3D"*" id=3D"id4">
<label>
=DF
<paragraph>
Auto-symbol footnote 4.
"""],
The \xa7 breaks going cross-platform. It doesn't produce a § on
my Mac.
[Marc-Andre Lemburg]
> The unicode-escape codecs (raw and normal) both extend the
> Latin-1 encoding with a few escaped characters.
Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?
> You should first get a feeling for what kind of mapping
> you expect, i.e. which characters should be escaped or not.
I would like to use a codec which escapes all char for ord(char)
from 128 up, but leaves all 7-bit ASCII alone. Is there any such beast?
--=20
David Goodger goodger@users.sourceforge.net Open-source projects:
- Python Docstring Processing System: http://docstring.sourceforge.net
- reStructuredText: http://structuredtext.sourceforge.net
- The Go Tools Project: http://gotools.sourceforge.net