[Python-ideas] Fix default encodings on Windows

Sat Aug 13 04:12:32 EDT 2016

Steve Dower writes:

 > ISTM that changing sys.getfilesystemencoding() on Windows to
 > "utf-8" and updating path_converter() (Python/posixmodule.c;

I think this proposal requires the assumption that strings intended to
be interpreted as file names invariably come from the Windows APIs.  I
don't think that is true: Makefiles and similar, configuration files,
all typically contain filenames.  Zipfiles (see below).  Python is
frequently used as a glue language, so presumably receives such file
name information as (more or less opaque) bytes objects over IPC
channels.  These just aren't under OS control, so the assumption will
fail.

Supporting Windows users in Japan means dealing with lots of crap
produced by standard-oblivious software.  Eg, Shift JIS filenames in
zipfiles.  AFAICT Windows itself never does that, but the majority of
zipfiles I get from colleagues have Shift JIS in the directory (and
it's the great majority if you assume that people who use ASCII
transliterations are doing so because they know that non-Windows-users
can't handle Shift JIS file names in zipfiles).

So I believe bytes-oriented software must expect non-UTF-8 file names
in Japan.  UTF-8 may have penetration in the rest of the world, but
the great majority of my Windows-using colleagues in Japan still
habitually and by preference use Shift JIS in text files.  I suppose
that includes files that are used by programs, and thus file names,
and probably extends to most Windows users here.

I suspect a similar situation holds in China, where AIUI "GB is not
just a good idea, it's the law,"[1] and possibly Taiwan (Big 5) and Korea
(KSC) as those standards have always provided the benefits of (nearly)
universal repertoires[2].

 > and add the requirement that [bytes file names] *must* be encoded
 > with sys.getfilesystemencoding().

To the extent that this *can* work, it *already* works.  Trying to
enforce a particular encoding will simply break working code that
depends on sys.getfilesystemencoding() matching the encoding that
other programs use.

You have no carrot.  These changes enforce an encoding on bytes for
Windows APIs but can't do so for data, and so will make file-names-
are-just-bytes programmers less happy with Python, not more happy.

The exception is the proposed console changes, because there you *do*
perform all I/O with OS APIs.  But I don't know anything about the
Windows console except that nobody seems happy with it.

 > Similarly, locale.getpreferredencoding() on Windows returns a
 > legacy value - the user's active code page - which should generally
 > not be used for any reason.

This is even less supportable, because it breaks much code that used
to work without specifying an encoding.

Refusing to respect the locale preferred encoding would force most
Japanese scripters to specify encodings where they currently accept
the system default, I suspect.  On those occasions my Windows-using
colleagues deliver text files, they are *always* encoded in Shift JIS.
University databases the deliver CSV files allow selecting Shift JIS
or UTF-8, and most people choose Shift JIS.  And so on.  In Japan,
Shift JIS remains pervasive on Windows.

I don't think Japan is special in this, except in the pervasiveness of
Shift JIS.  For everybody I think there will be more loss than benefit
imposed.

 > BOMs are very common on Windows, since the default assumption is
 > nearly always a bad idea.

I agree (since 1990!) that Shift JIS by default is a bad idea, but
there's no question that it is still overwhelmingly popular.  I
suspect UTF-8 signatures are uncommon, too, as most UTF-8 originates
on Mac or *nix platforms.

 > This would match the behavior that the .NET Framework has used for
 > many years - effectively, utf_8_sig on read and utf_8 on write.

But .NET is a framework.  It expects to be the world in which programs
exist, no?  Python is very frequently used as a glue language, and I
suspect the analogy fails due to that distinction.

Footnotes: 
[1]  Strictly speaking, certain programs must support GB 18030.  I
don't think it's legally required to be the default encoding.

[2]  For example, the most restricted Japanese standard, JIS X 0208,
includes not only "full-width" versions of ASCII characters, but the
full Greek and Cyrillic alphabets, many math symbols, a full line
drawing set, and much more besides the native syllabary and Han
ideographs.  The elderly Chinese GB 2312 not only includes Greek and
Cyrillic, and the various symbols, but also the Japanese syllabaries.
(And the more recent GB 18030 swallowed Unicode whole.)