[Python-ideas] Fix default encodings on Windows

Mon Aug 15 14:26:34 EDT 2016

On 15Aug2016 0954, Random832 wrote:
> On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
>> I'm still not sure we're talking about the same thing right now.
>>
>> For `open(path_as_bytes).read()`, are we talking about the way
>> path_as_bytes is passed to the file system? Or the codec used to decide
>> the returned string?
>
> We are talking about the way path_as_bytes is passed to the filesystem,
> and in particular what encoding path_as_bytes is *actually* in, when it
> was obtained from a file or other stream opened in binary mode.

Okay good, we are talking about the same thing.

Passing path_as_bytes in that location has been deprecated since 3.3, so 
we are well within our rights (and probably overdue) to make it a 
TypeError in 3.6. While it's obviously an invalid assumption, for the 
purposes of changing the language we can assume that no existing code is 
passing bytes into any functions where it has been deprecated.

As far as I'm concerned, there are currently no filesystem APIs on 
Windows that accept paths as bytes.

Given that, I'm proposing adding support for using byte strings encoded 
with UTF-8 in file system functions on Windows. This allows Python users 
to omit switching code like:

if os.name == 'nt':
     f = os.stat(os.listdir('.')[-1])
else:
     f = os.stat(os.listdir(b'.')[-1])

Or simply using the bytes variant unconditionally because they heard it 
was faster (sacrificing cross-platform correctness, since it may not 
correctly round-trip on Windows).

My proposal is to remove all use of the *A APIs and only use the *W 
APIs. That completely removes the (already deprecated) use of bytes as 
paths. I then propose to change the (unused on Windows) 
sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into 
filesystem functions by transcoding into UTF-16 and calling the *W APIs.

This completely removes the active codepage from the chain, allows paths 
returned from the filesystem to correctly roundtrip via bytes in Python, 
and allows those bytes paths to be manipulated at '\' characters. 
(Frankly I don't mind what encoding we use, and I'd be quite happy to 
force bytes paths to be UTF-16-LE encoded, which would also round-trip 
invalid surrogate pairs. But that would prevent basic manipulation which 
seems to be a higher priority.)

This does not allow you to take bytes from an arbitrary source and 
assume that they are correctly encoded for the file system. Python 3.3, 
3.4 and 3.5 have been warning that doing that is deprecated and the path 
needs to be decoded to a known encoding first. At this stage, it's time 
for us to either make byte paths an error, or to specify a suitable 
encoding that can correctly round-trip paths.

If this does not answer the question, I'm going to need the question to 
be explained more clearly for me.

Cheers,
Steve