
On 15Aug2016 0954, Random832 wrote:
On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
I'm still not sure we're talking about the same thing right now.
For `open(path_as_bytes).read()`, are we talking about the way path_as_bytes is passed to the file system? Or the codec used to decide the returned string?
We are talking about the way path_as_bytes is passed to the filesystem, and in particular what encoding path_as_bytes is *actually* in, when it was obtained from a file or other stream opened in binary mode.
Okay good, we are talking about the same thing. Passing path_as_bytes in that location has been deprecated since 3.3, so we are well within our rights (and probably overdue) to make it a TypeError in 3.6. While it's obviously an invalid assumption, for the purposes of changing the language we can assume that no existing code is passing bytes into any functions where it has been deprecated. As far as I'm concerned, there are currently no filesystem APIs on Windows that accept paths as bytes. Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like: if os.name == 'nt': f = os.stat(os.listdir('.')[-1]) else: f = os.stat(os.listdir(b'.')[-1]) Or simply using the bytes variant unconditionally because they heard it was faster (sacrificing cross-platform correctness, since it may not correctly round-trip on Windows). My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths. I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs. This completely removes the active codepage from the chain, allows paths returned from the filesystem to correctly roundtrip via bytes in Python, and allows those bytes paths to be manipulated at '\' characters. (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.) This does not allow you to take bytes from an arbitrary source and assume that they are correctly encoded for the file system. Python 3.3, 3.4 and 3.5 have been warning that doing that is deprecated and the path needs to be decoded to a known encoding first. At this stage, it's time for us to either make byte paths an error, or to specify a suitable encoding that can correctly round-trip paths. If this does not answer the question, I'm going to need the question to be explained more clearly for me. Cheers, Steve