[I18n-sig] PEP 263 and Japanese native encodings

Tamito KAJIYAMA kajiyama@grad.sccs.chukyo-u.ac.jp
Thu, 7 Mar 2002 19:15:22 +0900

martin@v.loewis.de (Martin v. Loewis) writes:
| > Shift_JIS is not ASCII compatible in a similar way.  It uses
| > backslash as a second byte.  Here is another example:
| > 
| >   >>> u"\u8868".encode("japanese.sjis")
| >   '\225\\'
| I see. I missed the part that the second byte can be in the range
| 0x40-0xFC. If I understand the problem correctly, the quotation
| characters (", ') can *not* appear as the second byte, right?


| Also, there is a total of 60 characters that end in byte \x5C;

Not right.  In JIS X 0208-1983 (6877 characters) there are 37
characters that end in byte \x5C.

| and those will only cause a problem if immediately followed by
| a quoting character.

You've described only the condition of a syntax error; backslash
as a second byte causes run-time problems even when it is
followed by some characters.  Let's consider the following
example.  The byte sequence shown below represents the content
of a string literal in a Shift_JIS encoded source file.  Its
Unicode representation is u"\u88681\u53C2\u7167" ("See Table 1"
in Japanese).

  95 5C 31 8E 51 8F C6

Now, the second byte is backslash and thus the third byte ("1")
gets backslash-escaped ("\1").  So, Python gives the string
literal the following wrong value:

  95 01 8E 51 8F C6

| Do you think those 60 characters would cause a problem in real life?

Yes, absolutely.

| Or is that a problem that only exists on paper?

No.  Suppose that you could not put common English words like
"table", "reserve", "ten" and "paste" in string literals; such
a restriction would not be acceptable at all, right? :-)

| > This is a well-known and highly annoying problem of Python in
| > Japanese Windows environment in which Shift_JIS is the system's
| > default encoding.  There is a patch for Python specifically
| > fixing this problem.
| A patch specifically designed for Shift_JIS probably is not acceptable
| to Python. A patch solving the general problem (in some way) may be.

Yes, I think so too.  The patch I metioned is a localization
patch, not intended to be merged into the Python core.

| > I don't want the PEP to exclude the two widely used Japanese
| > encodings, especially Shift_JIS.
| Then you need to propose an implementation strategy, and that strategy
| should *not* be "special-case Shift_JIS", and it also should not be
| "use the C library's multibyte functions".

I've thought that Marc-Andre's intent for ASCII compatibility
(i.e., ASCII compatible encodings should be able to represent
the first two lines of comments only by ASCII characters) is
good enough.  It appears that his requirement has no problem
with regard to the implementation stategy described in the PEP
(revision 1.9) *and* Japanese encodings.  IMHO, the ASCII
compatibility simply should not impose other requirements.


KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>