[I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)
Brian Takashi Hooper
brian@tomigaya.shibuya.tokyo.jp
Tue, 20 Feb 2001 19:16:09 +0900
Here's the second message, from Tamito Kajiyama, contributor of the SJIS
and EUC-JP codecs:
----
On Sun, 11 Feb 2001 20:18:51 +0900
Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp> wrote:
> Hi there,
>
> What does everyone think of the Proposed Character Model?
I was also one of the people that Andy asked to contribute an opinion,
so after reviewed the thread and here's what I have to say:
I understand Paul's Pre-PEP as raising the following three points:
1. Deprecate the usage of the present string type as containing a
sequence of bytes, and instead interpret string literals as containing
Unicode characters. (Unify the present character strings and Unicode
strings.)
2. Introduce a new data type (byte strings) for expressing an
uninterpreted byte sequence.
3. Add a convention for specifying the encoding of a source file.
In Python 2.0, there are separate data types for non-Unicode
strings and Unicode character strings. The proposals 1. and 2.
are essentially to replace these data types with the (Unicode)
character sequence and byte sequence data types.
Personally, I am opposed to the proposals 1. and 2. for the
following two reasons:
(1) The string types in Python 2.0 and the new string types
proposed in the pre-PEP have a relationship something like this:
Python 2.0 Pre-PEP
string "" (byte sequence) byte string b""
Unicode string u"" (Unicode string ""
character sequence)
In general, the before- and after-PEP Pythons above have essentially no
difference in expressiveness, and therefore it's hard to see what merit
there might be in swapping the data types.
On the other hand, I believe that swapping byte sequence and character
sequence data types as described above has several serious demerits for
Japanese Python developers.
Japanese programmers have a regular need to handle legacy encodings such as
EUC-JP and Shift JIS in their programs. Regular conversion back-and-forth
between Unicode and legacy encodings introduces a significant cost
in terms of resource usage and performance. Moreover, there is the
problem of incompatibilities between different Unicode conversion tables.
Furthermore, Japanese programmers are accustomed to dealing with Japanese
strings as byte sequences. Japanese users have a real motivation to
manipulate Japanese character strings as sequences of bytes. Regardless
of whether Unicode is supported or not, the byte sequence data type is
necessary in order to represent Japanese characters.
The present implementation of strings in Python, where a string represents
a sequence of bytes, is one feature that makes Python easy for Japanese
developers to use. Changing strings to contain Unicode character data
would impose a heavy burden for development and maintenance on Japanese
Python programmers. Therefore, I'm against swapping byte string and
character (Unicode) string types.
(2) It is not always possible to unambiguously interpret string literals
as Unicode character data
As you know, in Japanese-encoded byte strings, 2 bytes often represent
1 character. Therefore, the position of characters is expressed in terms
of bytes, not characters. Because of this, if a Japanese-encoded byte
string is interpreted as-is as a Unicode character string, indexes into
the string would no longer be interpreted the same way. For example, in
the below code snippet the substring is output differently depending on
whether the string literal is interpreted as a byte sequence or Unicode
character sequence:
s = "これは日本語の文字列です。"
print s[6:12]
Hard coding of slices as with the above is a common practice,
I believe. Paul has asserted that no serious problems will occur if
existing byte sequences are interpreted as Unicode, but I disagree with
him on this.
Due to the above two reasons, I cannot agree with the pre-PEP's first
two proposals (1. and 2.).
However, I believe the 3rd proposal to explicitly specify source file
encoding is a necessary improvement, leaving aside for the moment the
question of implementation.
In Python 2.0, if a program is written containing Japanese strings in
Shift-JIS, Python may raise parser errors. As many of you may know,
in Shift-JIS encoded strings the second byte of some Japanese characters
may be a backslash (ASCII 0x5c), and this conflicts with the backslash
escaping in the string literal. As far as I know, this is also the case
with the Chinese encoding Big 5.
One way to solve this problem is to apply Ishimoto-san's Shift-JIS
patch [1] to Python, but I feel that a more desirable solution is
to allow Python itself to handle files with different source encodings.
However, the intent of Paul's 3rd suggestion seems directed at solving
a different problem than that of allowing specification of an encoding
for byte strings. On the other hand, Marc-Andre's proposal [2] is to
use the source file encoding only for the decoding of non-Unicode
characters in character strings, without touching the contents of byte
strings. While I prefer Marc-Andre's proposal since it seems to be
a straightforward extension of Python 2.0's current Unicode support,
it doesn't address the aforementioned problem with the usage of
Shift-JIS and Big 5 in Python programs. Concerning this point,
I think there is a need to start another discussion aside from Paul's
pre-PEP.
[1] http://www.gembook.org/python/
http://www.gembook.org/python/python20-sjis-20001202.zip
[2] http://mail.python.org/pipermail/i18n-sig/2001-February/000756.html
----------------------------------------------------------------------
--
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>