[Python-ideas] Fix default encodings on Windows

Sat Aug 13 08:23:35 EDT 2016

On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
> Steve Dower writes:
>  > ISTM that changing sys.getfilesystemencoding() on Windows to
>  > "utf-8" and updating path_converter() (Python/posixmodule.c;
> 
> I think this proposal requires the assumption that strings intended to
> be interpreted as file names invariably come from the Windows APIs.  I
> don't think that is true: Makefiles and similar, configuration files,
> all typically contain filenames.  Zipfiles (see below). 

And what's going to happen if you shovel those bytes into the
filesystem without conversion on Linux, or worse, OSX? This problem
isn't unique to Windows.

> Python is frequently used as a glue language, so presumably receives
> such file name information as (more or less opaque) bytes objects over
> IPC  channels.

They *can't* be opaque. Someone has to decide what they mean, and you as
the application developer might well have to step up and *be that
someone*. If you don't, someone else will decide for you.

> These just aren't under OS control, so the assumption will
> fail.
> 
> So I believe bytes-oriented software must expect non-UTF-8 file names
> in Japan. 

The only way to deal with data representing filenames and destined for
the filesystem on windows is to convert it, somehow, ultimately to
UTF-16-LE. Not doing so is impossible, it's only a question of what
layer it happens in. If you convert it using the wrong encoding, you
lose. The only way to deal with it on Mac OS X is to convert it to
UTF-8. If you don't, you lose. If you convert it using the wrong
encoding, you lose.

This proposal embodies an assumption that bytes from unknown sources
used as filenames are more likely to be UTF-8 than in the locale ACP
(i.e. "mbcs" in pythonspeak, and Shift-JIS in Japan). Personally, I
think the whole edifice is rotten, and choosing one encoding over
another isn't a solution; the only solution is to require the
application to make a considered decision about what the bytes mean and
pass its best effort at converting to a Unicode string to the API. This
is true on Windows, it's true on OSX, and I would argue it's pretty
close to being true on Linux except in a few very niche cases. So I
think for the filesystem encoding we should stay the course, continuing
to print a DeprecationWarning and maybe, just maybe, eventually actually
deprecating it.

On Windows and OSX, this "glue language" business of shoveling bytes
from one place to another without caring what they mean can only last as
long as they don't touch the filesystem.

> You have no carrot.  These changes enforce an encoding on bytes for
> Windows APIs but can't do so for data, and so will make file-names-
> are-just-bytes programmers less happy with Python, not more happy.

I think the use case that the proposal has in mind is a
file-names-are-just-
bytes program (or set of programs) that reads from the filesystem,
converts to bytes for a file/network, and then eventually does the
reverse - either end may be on windows. Using UTF-8 will allow those to
make the round trip (strictly speaking, you may need surrogatepass, and
OSX does its weird normalization thing), using any other encoding
(except for perhaps GB18030) will not.