[Python-Dev] PEP 263 considered faulty (for some Japanese)

SUZUKI Hisao suzuki@acm.org
Sat, 16 Mar 2002 16:25:05 JST


In message <m34rjj12l7.fsf@mira.informatik.hu-berlin.de>
> > And almost every operating system in Japan is on the way to
> > adopt Unicode to save us from the chaos.  I am afraid the
> > mileage of the PEP will be fairly short and just results in
> > loading a lot of burden onto the language, 
> 
> That is a mis-perception. The PEP does not add a lot of burden onto
> the language; the stage-1 implementation is fairly trivial. The
> biggest change in the interpreter will be to have the parser operate
> on Unicode all the time; that change will be necessary stage 2,
> whether UTF-8 will be the default encoding or not.

I see.  After all, the java compiler performs such a task now.

But I wonder what about codecs of the various encodings of various
countries.  Each of Main-land China, Taiwan, Korea, and Japan has its
own encoding(s).  They will have their own large table(s) of truly
many characters.  Will not this make the interpreter a huge one?  And
once UTF comes to major, will not they become a huge legacy?

Maybe each local codec(s) must be packed into a so-called Country
Specific Package, which can be optional in the Python distribution.  I
believe you have considered such thing already.  In additon, I see
this problem does not relate to PEP 263 itself in the strict sense.
The PEP just makes use of codecs which happen to be there, only
requiring that each name of them must match with that of Emacs,
doesn't it?

> Also, the warning framework added will help people migrating - whether
> towards UTF-8 or custom locally-declared encodings is their choice.

As for declared encodings, I have one thing to say.  And this is
another point where THE CURRENT PEP CONSIDERED FAULTY FOR SOME
JAPANESE.  (It relates to UTF-16 again. *sigh*)

In short:

If the current PEP regards UTF-8 BOM, why it does not allow UTF-16
_with_ BOM?  The implementation would be very trivial.  UTF-16 with
BOM is becoming somewhat popular among casual users in Japan.

In long:

It is true that many Japanese developers do not use UTF-16 at all (and
may be even suspicious of anyone who talks about the use of it ;-).
However, the rest of us sometimes use UTF-16 certainly.  You can edit
UTF-16 files with, say, jEdit (www.jedit.org) on many platforms,
including Unix and Windows.  And in particular, you can use TextEdit
on Mac.  TextEdit on Mac OS X is a counterpart of notepad and wordpad
on Windows.

UTF-16 is typically 2/3 size to UTF-8 when many CJK chararcters are
used (each of them are 3 bytes in UTF-8 and 2 bytes in UTF-16).  And
in particular on Japanese Mac, it has more support than other
plain-text encodings.  In the default setting, TextEdit saves a
plain-text file in Shift_JIS or UTF-16.  Shift_JIS suffers from the
lack of several important characters which are used in the real life
(most notably, several Kanji used in some surnames... Yes, there is a
fair motivation among casual Japanese to use Unicode!).  Once
non-Shift_JIS character is used in the file, it will be saved in
UTF-16 by default (not UTF-8, regrettably.  Perhaps it may be because
of the "mojibake" problem partly).

Now iBook, iMac and PowerMac are fairly popular among casual users in
Japan; they are almost always within the top 10 in PC sales.  Since
Mac OS X has become the default recently, casual users are very likely
to happen to write their scripts in UTF-16 with TextEdit.

# Since TextEdit has similar key-bindings to Emacs, even power users
# may want to use it to edit his script.  Indeed I do so.

By the way, I had reported another problem which may make PEP 263
faulty, you know.  There had been a project which one must operate on
certain fixed-length texts in utf-16-be.  In java world such data are
not so rare.  It was where that encoding was used as default.  But I
now see it would be reasonable not to depend on default in such cases.
Anyway one could say that was a special case...

But this is not so.  UTF-16 files is becoming popular among
not-so-little users of Mac in Japan.  Easy usability of various
characters which are not found in classic JIS but in Unicode attracts
some Japanese indeed.  (Look into several Mac magazines in Japan, if
you can.)

As the programming language for everyone, it will be very nice for
Python to accept such scripts.  I believe the orthogonality has been
also one of the most imporant virtues of Python.

The implementation would be fairly straight.  If the file begins in
either 0xFE 0xFF or 0xFF 0xFE, it must be UTF-16.


> > I know it is not the best practice either.  However, you cannot
> > safely write Shift_JIS into Python source file anyway, unless
> > you hack up the interpreter parser itself for now.  
> 
> In stage 2 of the PEP, this will be possible (assuming Python has
> access to a Shift_JIS codec).

Yes, I have appreciated the PEP on this very possibility.  We will be
also able to use even ISO-2022-JP.

If the stage2 comes soon within a year and it is highly stable, it may
be EXTREMELY useful in Japan.  Or else, I am afraid it might become
bear's service...
(Maybe only UTF - Unicode codecs will be relevant...)

--
SUZUKI Hisao <suzuki@acm.org> "Bye bye, Be!"