[Python-Dev] PEP 263 considered faulty (for some Japanese)

Mon, 11 Mar 2002 23:32:57 -0500

>    I am a Japanese fan/developer/user of Python for years.  I
> have recently read the PEP 263 --- Defining Python Source Code
> Encodings.  I have been discussing about it on the Japanese
> mailing list of Python last week, and I and others found a
> severe fault in it.
>    I have also read the Parade of the PEPs and know that it is
> very close to being checked in, so I am writing this message to
> you in English in a hurry.  The PEP 263, as is, will damage the
> usability of Python in Japan.

Thanks for writing!  I promise you that we won't hurry to check in the
PEP until we have thoroughly examined your objection.  Since encodings
are much more important for your country than for most western
countries, it would be a mistake if we added a feature that had the
opposite effect for you as intended!

>    The PEP says, "Just as in coercion of strings to Unicode,
> Python will default to the interpreter's default encoding (which
> is ASCII in standard Python installations) as standard encoding
> if no other encoding hints are given."  This will let many
> English people free from writing the magic comment to their
> scripts explicitly.  However, many Japanese set the default
> encoding other than ASCII (we use multi-byte encodings for daily
> use, not as luxury), and some Japanese set it, say, "utf-16".

Is your objection specifically focused on UTF-16?  As far as I
understand, UTF-16 is (mostly) a two-byte encoding, that is not a
superset of ASCII (i.e. the 8-bit string "abcd", when interpreted
using UTF-16, does not mean the same thing as the Unicode string
u"abcd").  This sets UTF-16 apart from most other encodings, in
particular UTF-8, but also (I believe) the common Japanese 8-bit
encodings like Shift-JIS and EUC-JP.

You write "set the default encoding".  There are many ways to set a
default encodings.  Python has a very specific way to set its default
encoding: the only way is to edit the site.py library module.  Is this
what you are referring to?

I would think that setting Python's default encoding to UTF-16 in this
way is a bad idea, because it breaks the main purpose of the default
encoding: to allow an automatic coercion from the 8-bit strings that
are used in many places in Python programs to Unicode strings.  It
would mean that a program that writes

    if "abcd" == u"abcd":
       print "OK"
    else:
       print "Booh"

would print "Booh", because "abcd" interpreted as UTF-16 is a
two-character Unicode string (and I wouldn't be surprised if it
contained invalid code points).

For this reason, I find it hard to believe that people really set the
Python default encoding in site.py to "utf-16".  Maybe I'm wrong -- or
maybe you're talking about a different default encoding?

>    By the PEP as is, persons who use "utf-16" etc. will not be
> able to use many Python scripts any more.  Certainly you can
> tell them not to use "utf-16" as the default encoding.  But some
> of them have been writing their scripts in ASCII just as
> specified in the Language Reference, just omitting the encoding
> specification from their scripts to handle their Unicode
> documents easily.  Thus it would be safe to say that it is
> simply unfair.

It sounds like these people never rely on the implicit conversion
between 8-bit strings and Unicode as I showed above, but instead use
explicit conversions from data read from Unicode files, omitting the
encoding.  So maybe you really do mean what I fear (setting Python's
default encoding to UTF-16 in site.py).

>    I would propose that Python should default to ASCII as
> standard encoding if no other encoding hints are given, as the
> bottom line.  The interpreter's default encoding should not be
> referred for source code.

Unfortunately, this doesn't work for people in Europe, who set Latin-1
as the default encoding, and want to use Latin-1 in their source
files.

I think I can propose a compromise though: there may be two default
encodings, one used for Python source code, and one for data.
Normally, a single default encoding is used for both cases, but it is
possible to specify two different defaults, and then persons who like
UTF-16 can set ASCII as the default source encoding but UTF-16 as the
default data encoding.

>    And I hope that Python defaults to UTF-8 as standard encoding
> if no other encoding hints are given.  It is ASCII-compatible
> perfectly and language-neutral.  If you once commit yourself to
> Unicode, I think, UTF-8 is an obvious choice anyway.

I'm not sure I understand.  (I understand UTF-8 perfectly well. :-)
In the previous paragraph you propose to default to ASCII.  In this
paragraph you propose to default to UTF-8.  Now which do you want?  Or
do you want to propose these two for different situations?

Note that I originally wanted to use UTF-8 as the default encoding,
but was convinced otherwise by the Europeans who rarely use UTF-8 but
often Latin-1: but rather than giving anyone preferential treatment
(except for the Americans, whose keyboads don't have keys that
generate non-ASCII characters :-), I decided that the only fair
solution was to default to ASCII, which has the property that any
non-ASCII characters are considered an error.  But of course, the
option to edit site.py sort of defeats this purpose. :-)

>    From my experiences, inserting the '-*- coding: <coding name>
> -*-' line into an existing file and converting such a file into
> UTF-8 are almost the same amount of work.

Yes, for those people who have a UTF-8 toolchain set up.  I expect
that many Europeans don't have one handy, because their needs are met
by Latin-1.

> We will be glad if Python understands Japanese (and other)
> characters by default (by adopting, say, UTF-8 as default).

I think that in the future, we be able to change the default to
UTF-8.  Picking ASCII as the "official" default has the advantage that
it will let us switch to UTF-8 in the future, when we feel that there
is enough support for UTF-8 in the average computer system.

--Guido van Rossum (home page: http://www.python.org/~guido/)