[Python-Dev] PEP 263 considered faulty (for some Japanese)

18 Mar 2002 11:02:34 +0900

>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:

    Martin> "SUZUKI Hisao" <suzuki@acm.org> writes:

    >> The PEP just makes use of codecs which happen to be there, only
    >> requiring that each name of them must match with that of Emacs,
    >> doesn't it?

    Martin> Correct. I think the IANA "preferred MIME name" for the
    Martin> encoding should be used everywhere; this reduces the need
    Martin> for aliases.

Emacs naming compatibility is of ambiguous value in the current form
of the PEP, since the cookie only applies to Unicode string literals.
The Emacs coding cookie applies to the whole file.  So this means that
to implement a Python mode that allows (eg) a hexl mode on ordinary
string literals but regular text mode on Unicode string literals,
Emacs must _ignore_ Python coding cookies!

True, the usual case is that programmers will find it convenient to
have both ordinary string literals and Unicode string literals decoded
to text in Emacs buffers.  In other words, this PEP serves to
perpetuate use of ordinary string literals in localized applications.

Probably more so than it encourages use of Unicode literals, IMO.  :-(

    Martin> Also, I'm in favour of exposing the system codecs (on
    Martin> Linux, Windows, and the Mac); if that is done, there may
    Martin> be no need to incorporate any additional codecs in the
    Martin> Python distribution.

XEmacs just did this on Windows; it was several man-months of work,
and required a new API.  If by "expose" you mean their APIs, then
there will need to be a set of Python codec wrappers for these, at
least.

    >> UTF-16 is typically 2/3 size to UTF-8 when many CJK chararcters
    >> are used (each of them are 3 bytes in UTF-8 and 2 bytes in
    >> UTF-16).

    Martin> While I see that this is a problem for arbitrary Japanese
    Martin> text,

Yes, but ordinary Japanese text is already like English: maybe three
bits of content in the byte.  There's a lot of saving to be gotten
from either explicit compression or compressing file systems.  Or
simply abolishing .doc files.<wink>

    Martin> I doubt you will find the 2/3 ratio for Python source code
    Martin> containing Japanese text in string literals and comments.

No, in fact it's more likely to be 3/2.

    Martin> For example, the parser currently uses fgets to get the
    Martin> next line of input.

Well, fgets should go away anyway.  Experience in XEmacs shows that
except for large (10^6 bytes or more) files, multiple layers of codecs
are not perceptible to users.  So if we implement phase 2 as "the
parser speaks UTF-8", then you glue on a UTF-16 codec at the front
which reads from the file, and the parser reads from a buffer which
contains UTF-8.

Applications where this overhead matters can use UTF-8 in their source
files, and the parser can use fgets to read from them.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.