
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below).
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX? This problem isn't unique to Windows.
Python is frequently used as a glue language, so presumably receives such file name information as (more or less opaque) bytes objects over IPC channels.
They *can't* be opaque. Someone has to decide what they mean, and you as the application developer might well have to step up and *be that someone*. If you don't, someone else will decide for you.
These just aren't under OS control, so the assumption will fail.
So I believe bytes-oriented software must expect non-UTF-8 file names in Japan.
The only way to deal with data representing filenames and destined for the filesystem on windows is to convert it, somehow, ultimately to UTF-16-LE. Not doing so is impossible, it's only a question of what layer it happens in. If you convert it using the wrong encoding, you lose. The only way to deal with it on Mac OS X is to convert it to UTF-8. If you don't, you lose. If you convert it using the wrong encoding, you lose.
This proposal embodies an assumption that bytes from unknown sources used as filenames are more likely to be UTF-8 than in the locale ACP (i.e. "mbcs" in pythonspeak, and Shift-JIS in Japan). Personally, I think the whole edifice is rotten, and choosing one encoding over another isn't a solution; the only solution is to require the application to make a considered decision about what the bytes mean and pass its best effort at converting to a Unicode string to the API. This is true on Windows, it's true on OSX, and I would argue it's pretty close to being true on Linux except in a few very niche cases. So I think for the filesystem encoding we should stay the course, continuing to print a DeprecationWarning and maybe, just maybe, eventually actually deprecating it.
On Windows and OSX, this "glue language" business of shoveling bytes from one place to another without caring what they mean can only last as long as they don't touch the filesystem.
You have no carrot. These changes enforce an encoding on bytes for Windows APIs but can't do so for data, and so will make file-names- are-just-bytes programmers less happy with Python, not more happy.
I think the use case that the proposal has in mind is a file-names-are-just- bytes program (or set of programs) that reads from the filesystem, converts to bytes for a file/network, and then eventually does the reverse - either end may be on windows. Using UTF-8 will allow those to make the round trip (strictly speaking, you may need surrogatepass, and OSX does its weird normalization thing), using any other encoding (except for perhaps GB18030) will not.