[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Mon Aug 15 14:26:34 EDT 2016
On 15Aug2016 0954, Random832 wrote:
> On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
>> I'm still not sure we're talking about the same thing right now.
>>
>> For `open(path_as_bytes).read()`, are we talking about the way
>> path_as_bytes is passed to the file system? Or the codec used to decide
>> the returned string?
>
> We are talking about the way path_as_bytes is passed to the filesystem,
> and in particular what encoding path_as_bytes is *actually* in, when it
> was obtained from a file or other stream opened in binary mode.
Okay good, we are talking about the same thing.
Passing path_as_bytes in that location has been deprecated since 3.3, so
we are well within our rights (and probably overdue) to make it a
TypeError in 3.6. While it's obviously an invalid assumption, for the
purposes of changing the language we can assume that no existing code is
passing bytes into any functions where it has been deprecated.
As far as I'm concerned, there are currently no filesystem APIs on
Windows that accept paths as bytes.
Given that, I'm proposing adding support for using byte strings encoded
with UTF-8 in file system functions on Windows. This allows Python users
to omit switching code like:
if os.name == 'nt':
f = os.stat(os.listdir('.')[-1])
else:
f = os.stat(os.listdir(b'.')[-1])
Or simply using the bytes variant unconditionally because they heard it
was faster (sacrificing cross-platform correctness, since it may not
correctly round-trip on Windows).
My proposal is to remove all use of the *A APIs and only use the *W
APIs. That completely removes the (already deprecated) use of bytes as
paths. I then propose to change the (unused on Windows)
sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into
filesystem functions by transcoding into UTF-16 and calling the *W APIs.
This completely removes the active codepage from the chain, allows paths
returned from the filesystem to correctly roundtrip via bytes in Python,
and allows those bytes paths to be manipulated at '\' characters.
(Frankly I don't mind what encoding we use, and I'd be quite happy to
force bytes paths to be UTF-16-LE encoded, which would also round-trip
invalid surrogate pairs. But that would prevent basic manipulation which
seems to be a higher priority.)
This does not allow you to take bytes from an arbitrary source and
assume that they are correctly encoded for the file system. Python 3.3,
3.4 and 3.5 have been warning that doing that is deprecated and the path
needs to be decoded to a known encoding first. At this stage, it's time
for us to either make byte paths an error, or to specify a suitable
encoding that can correctly round-trip paths.
If this does not answer the question, I'm going to need the question to
be explained more clearly for me.
Cheers,
Steve
More information about the Python-ideas
mailing list