[I18n-sig] Japanese commentary on the Pre-PEP (2 of 4)

Brian Takashi Hooper brian@tomigaya.shibuya.tokyo.jp
Tue, 20 Feb 2001 19:16:09 +0900


Here's the second message, from Tamito Kajiyama, contributor of the SJIS
and EUC-JP codecs:

----

  On Sun, 11 Feb 2001 20:18:51 +0900
  Brian Takashi Hooper <brian@tomigaya.shibuya.tokyo.jp> wrote:

  > Hi there,
  > 
  > What does everyone think of the Proposed Character Model?

  I was also one of the people that Andy asked to contribute an opinion,
  so after reviewed the thread and here's what I have to say:

  I understand Paul's Pre-PEP as raising the following three points:

  1. Deprecate the usage of the present string type as containing a
  sequence of bytes, and instead interpret string literals as containing
  Unicode characters.  (Unify the present character strings and Unicode
  strings.)

  2. Introduce a new data type (byte strings) for expressing an 
  uninterpreted byte sequence.

  3. Add a convention for specifying the encoding of a source file.

 In Python 2.0, there are separate data types for non-Unicode
 strings and Unicode character strings.  The proposals 1. and 2.
 are essentially to replace these data types with the (Unicode)
 character sequence and byte sequence data types.

 Personally, I am opposed to the proposals 1. and 2. for the
 following two reasons:

 (1) The string types in Python 2.0 and the new string types
 proposed in the pre-PEP have a relationship something like this:

      Python 2.0                      Pre-PEP
      string "" (byte sequence)       byte string b""
      Unicode string u"" (Unicode     string ""
         character sequence)

  In general, the before- and after-PEP Pythons above have essentially no
  difference in expressiveness, and therefore it's hard to see what merit
  there might be in swapping the data types.

 On the other hand, I believe that swapping byte sequence and character
 sequence data types as described above has several serious demerits for
 Japanese Python developers.

  Japanese programmers have a regular need to handle legacy encodings such as
 EUC-JP and Shift JIS in their programs.  Regular conversion back-and-forth
 between Unicode and legacy encodings introduces a significant cost
 in terms of resource usage and performance.  Moreover, there is the
  problem of incompatibilities between different Unicode conversion tables.
  Furthermore, Japanese programmers are accustomed to dealing with Japanese
  strings as byte sequences.  Japanese users have a real motivation to
 manipulate Japanese character strings as sequences of bytes.  Regardless
 of whether Unicode is supported or not, the byte sequence data type is
 necessary in order to represent Japanese characters.

  The present implementation of strings in Python, where a string represents
  a sequence of bytes, is one feature that makes Python easy for Japanese
 developers to use.  Changing strings to contain Unicode character data
 would impose a heavy burden for development and maintenance on Japanese
 Python programmers.  Therefore, I'm against swapping byte string and
  character (Unicode) string types.

 (2) It is not always possible to unambiguously interpret string literals
 as Unicode character data

  As you know, in Japanese-encoded byte strings, 2 bytes often represent
 1 character.  Therefore, the position of characters is expressed in terms
 of bytes, not characters.  Because of this, if a Japanese-encoded byte
  string is interpreted as-is as a Unicode character string, indexes into
  the string would no longer be interpreted the same way.  For example, in
 the below code snippet the substring is output differently depending on
 whether the string literal is interpreted as a byte sequence or Unicode
  character sequence:

    s = "これは日本語の文字列です。"
    print s[6:12]

 Hard coding of slices as with the above is a common practice,
  I believe.  Paul has asserted that no serious problems will occur if
  existing byte sequences are interpreted as Unicode, but I disagree with
  him on this.

 Due to the above two reasons, I cannot agree with the pre-PEP's first
  two proposals (1. and 2.).

 However, I believe the 3rd proposal to explicitly specify source file
 encoding is a necessary improvement, leaving aside for the moment the
 question of implementation.

  In Python 2.0, if a program is written containing Japanese strings in
 Shift-JIS, Python may raise parser errors.  As many of you may know,
 in Shift-JIS encoded strings the second byte of some Japanese characters
 may be a backslash (ASCII 0x5c), and this conflicts with the backslash
 escaping in the string literal.  As far as I know, this is also the case
  with the Chinese encoding Big 5.

  One way to solve this problem is to apply Ishimoto-san's Shift-JIS
 patch [1] to Python, but I feel that a more desirable solution is
 to allow Python itself to handle files with different source encodings.

  However, the intent of Paul's 3rd suggestion seems directed at solving
  a different problem than that of allowing specification of an encoding
 for byte strings.  On the other hand, Marc-Andre's proposal [2] is to
 use the source file encoding only for the decoding of non-Unicode
 characters in character strings, without touching the contents of byte
 strings.  While I prefer Marc-Andre's proposal since it seems to be
 a straightforward extension of Python 2.0's current Unicode support,
 it doesn't address the aforementioned problem with the usage of
 Shift-JIS and Big 5 in Python programs.  Concerning this point,
  I think there is a need to start another discussion aside from Paul's
  pre-PEP.

 [1] http://www.gembook.org/python/
     http://www.gembook.org/python/python20-sjis-20001202.zip

 [2] http://mail.python.org/pipermail/i18n-sig/2001-February/000756.html

----------------------------------------------------------------------

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>