[Python-Dev] PEP 263 - default encoding
Stephen J. Turnbull
stephen@xemacs.org
18 Mar 2002 15:05:35 +0900
Taken out of order.
>>>>> "Guido" =3D=3D Guido van Rossum <guido@python.org> writes:
Guido> Same here. If you still think it's necessary, maybe you
Guido> can try to express exactly when you would want a program to
Guido> be declared illegal because of expected problems in phase
Guido> 2?
I guess my point is that I don't want to try to do that, because I'm
pretty sure I'd get it wrong for some common natural language or
platform encoding I have no specific knowledge of. Even the small
amount of detail in the PEP seems too much, to me. I think it's much
better to say "The parser accepts programs encoded in unicode. We
provide some hooks to help you get from encodings convenient for your
environment to Unicode, and some sample implementations of things to
hang on the hooks. But if there are problems with non-unicode files,
they're your problems."
I remain unconvinced that this PEP is as good as it could be, but I
don't have time to provide a full counter-proposal. It will provide
the benefits claimed for the people it's targeted to. However,
o There may be some audiences who are poorly served (Mr. Suzuki).
o I think it will definitely tend to encourage use of national/
platform encodings rather than UTF-8 in source. It may be hard to
get this sun to set.
o I think it makes it hard to implement helper tools (eg python-mode).
o I think it discourages a clean separation of the parser from the
codecs (see below for examples).
That said, it's clearly better than the current situation. Since the
people who will be implementing seem to be unconvinced by my
arguments, it's probably best to go ahead with it. I'll try to follow
implementation discussions and certainly would respond if asked.
>> Mr. Suzuki's friends. People who use UTF-16 strings in other
>> applications (eg Java), but otherwise are happy with English.
Guido> I think even Mr. Suzuki isn't thinking of using UTF-16 in
Guido> his Unicode literals. He currently sets UTF-16 as the
Guido> default encoding for data that he presumably reads from a
Guido> file.
Well, I'm not a native Japanese. But I have often edited English
strings that occur in swaths of unrecognizable octets that would be
Japanese if I had the terminal encoding set correctly. I have also
cut and pasted encoded Japanese into "binary" buffers.
And how is he going to use regexps or formatting sugar without literal
UTF-16 strings?
Guido> The other interpretation (that they would use UTF-16 inside
Guido> u"" and ASCII elsewhere) is just as insane, since no person
Guido> implementing a text editor with any form of encoding
Guido> support would be insane enough support such a mixed-mode
Guido> encoding.
"I resemble that remark."
Seriously, that is _exactly_ what X?Emacs/Mule does as implementation
of multilingual buffers, since it's basically modeless ISO 2022.
Currently it does not get display right for the interpretation I'm
suggesting for Python strings, but it wouldn't be hard. However, that
would require that Emacs _ignore_ the python coding cookie, and then
turn around and have python-mode do the work. (This isn't a big deal,
but the Python interpreter will implicitly be doing something
similar---you won't be able to apply a standard codec and get what you
want.)
>> Are you going to deprecate the practice of putting KOI8-R into
>> ordinary strings?
[example of how it works if you just let it work snipped]
Guido> I think this will actually work.
Right, as long as by "work" you mean "it's formally undefined but
8-bit clean stuff just passes through." The problem is that people
often do unclean things, like type ALT 1 8 5 to insert an 0xB9 octet,
which the editor assumes is intended to be =B9 in a Latin-2 locale.
However, if that file (which the user knows contains no Latin-2 at
all) is read in a Latin-2 locale, and translated to Unicode, the byte
value changes (in fact, it's no longer a byte value). What's a parser
to do?<wink>
This can be made safe by not decoding the contents of ordinary string
literals, but that requires that the parser has to do the lexing, you
can't delegate it to a general-purpose codec.
Guido> But the treatment of k under phase 2 will be, um,
Guido> interesting, and I'm not sure what it should do!!!
Bingo. And files which until that point embedded arbitrary binary
(ie, not representing characters) stop working, quite possibly. (This
is a natural hack to anybody familiar with Emacs/Mule.)
Guido> Since in phase 2 the entire file will be decoded from
Guido> KOI8-R to Unicode before it's parsed, maybe the best thing
Guido> would be to encode 8-bit string literals back using KOI8-R
Guido> (in general, the encoding given in the encoding cookie).
This probably mostly works, based on mule experience. But it requires
the parser to have carnal knowledge of coding systems. Isn't it
preferable to insist on UTF-8 here, since it's simply changing the
representation from one or two bytes back to constant-width one,
without changing values?
Also, you'd have to prohibit encodings using ISO 2022 control sequences,
as there are always many legal ways to encode the same text (there is
no requirement that a mode-switching sequence actually be followed by
any text before switching to a different mode), and there's no way to
distinguish except to record the original input.
You'd also have to document this use for the codecs, because otherwise
somebody might do something really cool like make the codecs produce
canonical Unicode (ie, either maximally decomposed or maximally
composed representations). This would also make reversal ambiguous
for any encoding that provides both composed and decomposed forms of
characters.
--=20
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac=
.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JA=
PAN
Don't ask how you can "do" free software business;
ask what your business can "do for" free software.