Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below). Python is frequently used as a glue language, so presumably receives such file name information as (more or less opaque) bytes objects over IPC channels. These just aren't under OS control, so the assumption will fail. Supporting Windows users in Japan means dealing with lots of crap produced by standard-oblivious software. Eg, Shift JIS filenames in zipfiles. AFAICT Windows itself never does that, but the majority of zipfiles I get from colleagues have Shift JIS in the directory (and it's the great majority if you assume that people who use ASCII transliterations are doing so because they know that non-Windows-users can't handle Shift JIS file names in zipfiles). So I believe bytes-oriented software must expect non-UTF-8 file names in Japan. UTF-8 may have penetration in the rest of the world, but the great majority of my Windows-using colleagues in Japan still habitually and by preference use Shift JIS in text files. I suppose that includes files that are used by programs, and thus file names, and probably extends to most Windows users here. I suspect a similar situation holds in China, where AIUI "GB is not just a good idea, it's the law,"[1] and possibly Taiwan (Big 5) and Korea (KSC) as those standards have always provided the benefits of (nearly) universal repertoires[2].
and add the requirement that [bytes file names] *must* be encoded with sys.getfilesystemencoding().
To the extent that this *can* work, it *already* works. Trying to enforce a particular encoding will simply break working code that depends on sys.getfilesystemencoding() matching the encoding that other programs use. You have no carrot. These changes enforce an encoding on bytes for Windows APIs but can't do so for data, and so will make file-names- are-just-bytes programmers less happy with Python, not more happy. The exception is the proposed console changes, because there you *do* perform all I/O with OS APIs. But I don't know anything about the Windows console except that nobody seems happy with it.
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason.
This is even less supportable, because it breaks much code that used to work without specifying an encoding. Refusing to respect the locale preferred encoding would force most Japanese scripters to specify encodings where they currently accept the system default, I suspect. On those occasions my Windows-using colleagues deliver text files, they are *always* encoded in Shift JIS. University databases the deliver CSV files allow selecting Shift JIS or UTF-8, and most people choose Shift JIS. And so on. In Japan, Shift JIS remains pervasive on Windows. I don't think Japan is special in this, except in the pervasiveness of Shift JIS. For everybody I think there will be more loss than benefit imposed.
BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
I agree (since 1990!) that Shift JIS by default is a bad idea, but there's no question that it is still overwhelmingly popular. I suspect UTF-8 signatures are uncommon, too, as most UTF-8 originates on Mac or *nix platforms.
This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write.
But .NET is a framework. It expects to be the world in which programs exist, no? Python is very frequently used as a glue language, and I suspect the analogy fails due to that distinction. Footnotes: [1] Strictly speaking, certain programs must support GB 18030. I don't think it's legally required to be the default encoding. [2] For example, the most restricted Japanese standard, JIS X 0208, includes not only "full-width" versions of ASCII characters, but the full Greek and Cyrillic alphabets, many math symbols, a full line drawing set, and much more besides the native syllabary and Han ideographs. The elderly Chinese GB 2312 not only includes Greek and Cyrillic, and the various symbols, but also the Japanese syllabaries. (And the more recent GB 18030 swallowed Unicode whole.)