Re: [Python-ideas] Fix default encodings on Windows

13 Aug 2016


      On 13Aug2016 0523, Random832 wrote:
...
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
...
Steve Dower writes:
...
ISTM that changing sys.getfilesystemencoding() on Windows to
"utf-8" and updating path_converter() (Python/posixmodule.c;
I think this proposal requires the assumption that strings intended to
be interpreted as file names invariably come from the Windows APIs.  I
don't think that is true: Makefiles and similar, configuration files,
all typically contain filenames.  Zipfiles (see below).
And what's going to happen if you shovel those bytes into the
filesystem without conversion on Linux, or worse, OSX? This problem
isn't unique to Windows.
Yeah, this is basically my view too. If your path bytes don't come from 
the filesystem, you need to know the encoding regardless. But it's very 
reasonable to be able to round-trip. Currently, the following two lines 
of code can have different behaviour on Windows (i.e. the latter fails 
to open the file):
...
...
...
open(os.listdir('.')[-1])
open(os.listdir(b'.')[-1])
On Windows, the filesystem encoding is inherently Unicode, which means 
you can't reliably round-trip filenames through the current code page. 
Changing all of Python to use the Unicode APIs internally and making the 
bytes encoding utf-8 (or utf-16-le, which would save a conversion) 
resolves this and doesn't really affect
...
...
These just aren't under OS control, so the assumption will
fail.
So I believe bytes-oriented software must expect non-UTF-8 file names
in Japan.
Even on Japanese Windows, non-UTF-8 file names must be encodable with 
UTF-16 or they cannot exist on the file system. This moves the encoding 
boundary into the application, which is where it needed to be anyway for 
robust software - "Correct" path handling still requires decoding to 
text, and if you know that your source is the encoded with the active 
code page then byte_path.decode('mbcs', 'surrogateescape') is still valid.

Cheers,
Steve