[I18n-sig] PEP 263 and Japanese native encodings

Thu, 7 Mar 2002 14:32:52 +0900

martin@v.loewis.de (Martin v. Loewis) writes:
| 
|   An ASCII string compatible encoding (character set) is a superset of
|   the ASCII encoding (character set) in which octets from set AS are
|   only used to represent ASCII characters and not used in a series of
|   bytes that represent a multibyte character (such as Kanji and
|   Hiragana). The set AS is defined as
| 
|   AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote)
| 
| The rationale here is that, under the PEP, non-ASCII text may only
| appear in comments and strings. The lexer needs the ASCII-compatible
| property to determine the end-of-line and end-of-string markers,
| atleast in the phase-1 implementation.
|
| > o Are three Japanese native encodings EUC-JP, Shift_JIS and
| >   ISO-2022-JP "ASCII compatible"?
| 
| EUC-JP certainly is;

Absolutely.

| ISO-2022-JP probably isn't.

Right, ISO-2022-JP is not ASCII compatible in the sense of your
definition.  It uses " and ' to represent both ASCII and JIS X
0208-1983 (Kanji, Hiragana, and so on).  For example, an
ISO-2022-JP representation of u"\u3042" (the first character of
Hiragana) contains a double quote mark:

  >>> u"\u3042".encode("japanese.iso-2022-jp")
  '\033$B$"\033(B'

(FYI: the first escape sequence \033$B is the mark that says the
following bytes represent a series of JIS X 0208-1983 characters.
The second \033(B has a similar meaning for ASCII.)

| I cannot see the problem with Shift_JIS;

Shift_JIS is not ASCII compatible in a similar way.  It uses
backslash as a second byte.  Here is another example:

  >>> u"\u8868".encode("japanese.sjis")
  '\225\\'

This is a well-known and highly annoying problem of Python in
Japanese Windows environment in which Shift_JIS is the system's
default encoding.  There is a patch for Python specifically
fixing this problem.

So, a definition of ASCII compatible encodings is very important
since it may or may not accept Shift_JIS and ISO-2022-JP.  I
believe other Asian native encodings are in a similar situation
with the two Japanese encodings.

I don't want the PEP to exclude the two widely used Japanese
encodings, especially Shift_JIS.  I think the only acceptable
requirement for an ASCII compatible encoding is the property
that it can represent the first two lines of comments only by
ASCII characters.  Other requirements will not make the two
Japanese encodings ASCII comatible.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>