[Python-Dev] PEP 263 considered faulty (for some Japanese)

SUZUKI Hisao suzuki611@oki.com
Tue, 12 Mar 2002 19:57:35 +0900


Thank you for reading my message.

> Is your objection specifically focused on UTF-16?  As far as I
> understand, UTF-16 is (mostly) a two-byte encoding, that is not a
> superset of ASCII (i.e. the 8-bit string "abcd", when interpreted
> using UTF-16, does not mean the same thing as the Unicode string
> u"abcd").  This sets UTF-16 apart from most other encodings, in
> particular UTF-8, but also (I believe) the common Japanese 8-bit
> encodings like Shift-JIS and EUC-JP.

Yes, UTF-16 is a two-byte encoding and not a superset of ASCII.
We use ISO-2022-JP, EUC-JP, UTF-8 and Shift_JIS _where_
ASCII-compatibility is needed more or less (for example, e-mail
messages and program source codes).

In addition, we _often_ have to handle Japanese documents
written in UTF-16.  They are produced sometimes by Java
programs, and sometimes by text editors.  Some of us currently
use Unicode mainly for them.

> You write "set the default encoding".  There are many ways to set a
> default encodings.  Python has a very specific way to set its default
> encoding: the only way is to edit the site.py library module.  Is this
> what you are referring to?

Yes.

> I would think that setting Python's default encoding to UTF-16 in this
> way is a bad idea, because it breaks the main purpose of the default
> encoding: to allow an automatic coercion from the 8-bit strings that
> are used in many places in Python programs to Unicode strings.  
[...]
> For this reason, I find it hard to believe that people really set the
> Python default encoding in site.py to "utf-16".  Maybe I'm wrong -- or
> maybe you're talking about a different default encoding?

What we handle in Unicode with Python is often a document file
in UTF-16.  The default encoding is mainly applied to data from
the document.  Certainly we use EUC-JP etc. in Python scripts,
but mostly use them as comments or some such things.

Setting the default to UTF-16 is often a handy way to handle
Unicode for the present.

> It sounds like these people never rely on the implicit conversion
> between 8-bit strings and Unicode as I showed above, but instead use
> explicit conversions from data read from Unicode files, omitting the
> encoding.  So maybe you really do mean what I fear (setting Python's
> default encoding to UTF-16 in site.py).

Yes, I mean such things.  Please note that
u'<whatever-in-ascii>' is interpreted just literally and we
cannot put Japanese characters in string literals legally for
now anyway.

Python 2.2 (#1, Jan 16 2002, 12:05:05) 
[GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf_16_be'
>>> u'abc'
u'abc'
>>> unicode("\x00a\x00b\x00c")
u'abc'


> >    I would propose that Python should default to ASCII as
> > standard encoding if no other encoding hints are given, as the
> > bottom line.  The interpreter's default encoding should not be
> > referred for source code.
> Unfortunately, this doesn't work for people in Europe, who set Latin-1
> as the default encoding, and want to use Latin-1 in their source
> files.

And neither for another some of us in Japan, who set EUC-JP as
the default encoding, and want to use EUC-JP in their source
files.

> I think I can propose a compromise though: there may be two default
> encodings, one used for Python source code, and one for data.
> Normally, a single default encoding is used for both cases, but it is
> possible to specify two different defaults, and then persons who like
> UTF-16 can set ASCII as the default source encoding but UTF-16 as the
> default data encoding.

It sounds very nice.  I understand that the default data
encoding will be applied to what from file objects.  It must be
the only(?) satisfying solution if the default source encoding
is to be set in site.py.
# Or else we should give up the default encoding for data...

> >    And I hope that Python defaults to UTF-8 as standard encoding
> > if no other encoding hints are given.  It is ASCII-compatible
> > perfectly and language-neutral.  If you once commit yourself to
> > Unicode, I think, UTF-8 is an obvious choice anyway.
> I'm not sure I understand.  (I understand UTF-8 perfectly well. :-)
> In the previous paragraph you propose to default to ASCII.  In this
> paragraph you propose to default to UTF-8.  Now which do you want?  Or
> do you want to propose these two for different situations?

I'm sorry for the ambiguity.
I proposed ASCII as the _minimum_ request.  I'd _hope_ UTF-8.

> Note that I originally wanted to use UTF-8 as the default encoding,
> but was convinced otherwise by the Europeans who rarely use UTF-8 but
> often Latin-1: but rather than giving anyone preferential treatment
> (except for the Americans, whose keyboads don't have keys that
> generate non-ASCII characters :-), I decided that the only fair
> solution was to default to ASCII, which has the property that any
> non-ASCII characters are considered an error.  But of course, the
> option to edit site.py sort of defeats this purpose. :-)

ASCII can express, I believe, _only_ English and classical Latin
well.  It would be safe to say that it is unfair for all people
in the world except for English-speaking people.

Once committed to Unicode, and if ASCII-compatibility is
mandatory, UTF-8, which is language-neutral, seems to be the
only fair solution to everyone.
# Of course, it might not be so if committed to ISO-2022...

> >    From my experiences, inserting the '-*- coding: <coding name>
> > -*-' line into an existing file and converting such a file into
> > UTF-8 are almost the same amount of work.
> Yes, for those people who have a UTF-8 toolchain set up.  I expect
> that many Europeans don't have one handy, because their needs are met
> by Latin-1.

Writing a converter from Latin-1 to UTF-8 is an easy exercise in
Python programming.  For a UTF-8 editor, IDLE on Tck/Tk8.3 may
be handy.

Those who want to use Latin-1 in the source code can always
specify '-*- coding: latin-1 -*-'.

> > We will be glad if Python understands Japanese (and other)
> > characters by default (by adopting, say, UTF-8 as default).

> I think that in the future, we be able to change the default to
> UTF-8.  Picking ASCII as the "official" default has the advantage that
> it will let us switch to UTF-8 in the future, when we feel that there
> is enough support for UTF-8 in the average computer system.

If one does not have enough support for UTF-8, and has some
8-bit clean editor (which is mandatory for Latin-1), I think,
UTF-8 is effectively the same as ASCII -- one can program
entirely in ASCII and cannot put national characters directly.

Others may distribute programs in UTF-8, but they will restrict
the usage of national characters reasonably (say, to their
signatures) if they want to make their programs open to all over
the world effectively.  Natinal characters may be displayed as
some meta characters on the editor.

# Of course, it may be not the case if the program depends deeply
# upon the local culture... (One example comes to my mind: I Ching
# -- it is pan-East-Asian)

--
SUZUKI Hisao <suzuki@acm.org> <suzuki611@okisoft.co.jp>