[Python-Dev] PEP 263 considered faulty (for some Japanese)
Stephen J. Turnbull
stephen@xemacs.org
12 Mar 2002 21:18:29 +0900
>>>>> "Guido" == Guido van Rossum <guido@python.org> writes:
Guido> [Not having one-octet ASCII as a subset] sets UTF-16 apart
Guido> from most other encodings, in particular UTF-8, but also (I
Guido> believe) the common Japanese 8-bit encodings like Shift-JIS
Guido> and EUC-JP.
This is correct; all of the encodings commonly used in Japan have the
property that one-octet ASCII is a subset (depending on how you define
"subset" for modal encodings like JUNET). I've never seen UTF-16 "in
the wild", but it's possible some groups do use it internally. But I
would expect that Python (with its well-organized codec interface)
would present small problem compared to ordinary text editors
(including both Emacsen) and other commonly used applications. As far
as I know none of the freely available recoding utilities (except GNU
recode and GNU iconv, which are not tuned to the Japanese environment)
support UTF-16. So it would be a very special environment.
Guido> "abcd" interpreted as UTF-16 is a two-character Unicode
Guido> string (and I wouldn't be surprised if it contained invalid
Guido> code points).
Fear not, they're in the middle of the CJK block. The second is
invalid in Japanese, though.
Guido> I think I can propose a compromise though: there may be two
Guido> default encodings, one used for Python source code, and one
Guido> for data.
Why go in this direction? It's better to allow each individual stream
to specify a codec to be implicitly applied, I think. Consider Emacs,
for example, which allows specification of default codecs for (1) file
contents (2) names of file system objects (3) process I/O (but not I
and O and E separately, which has caused problems!) (4) console input
and (5) console output. All of those are plausible candidates for
having separate defaults in Python as well.
For example, in Japan it's easy to imagine a program with local file
contents defaulting to UTF-8 (for cross-system portability) needing to
access the Windows 9x console and file system in Shift JIS, while
process (eg, network) I/O might be EUC-JP if the server were Unix.
(Yes, I'm straining, but not much.)
But if you allow codecs for each stream, people who want to have
different defaults for certain classes of stream would just derive
classes which initialized the default codec appropriately.
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Don't ask how you can "do" free software business;
ask what your business can "do for" free software.