On 13Aug2016 0523, Random832 wrote:
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
Steve Dower writes:
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to be interpreted as file names invariably come from the Windows APIs. I don't think that is true: Makefiles and similar, configuration files, all typically contain filenames. Zipfiles (see below).
And what's going to happen if you shovel those bytes into the filesystem without conversion on Linux, or worse, OSX? This problem isn't unique to Windows.
Yeah, this is basically my view too. If your path bytes don't come from the filesystem, you need to know the encoding regardless. But it's very reasonable to be able to round-trip. Currently, the following two lines of code can have different behaviour on Windows (i.e. the latter fails to open the file):
On Windows, the filesystem encoding is inherently Unicode, which means you can't reliably round-trip filenames through the current code page. Changing all of Python to use the Unicode APIs internally and making the bytes encoding utf-8 (or utf-16-le, which would save a conversion) resolves this and doesn't really affect
These just aren't under OS control, so the assumption will fail.
So I believe bytes-oriented software must expect non-UTF-8 file names in Japan.
Even on Japanese Windows, non-UTF-8 file names must be encodable with UTF-16 or they cannot exist on the file system. This moves the encoding boundary into the application, which is where it needed to be anyway for robust software - "Correct" path handling still requires decoding to text, and if you know that your source is the encoded with the active code page then byte_path.decode('mbcs', 'surrogateescape') is still valid.