[Python-ideas] Fix default encodings on Windows

Thu Aug 18 11:25:33 EDT 2016

Summary for python-dev.

This is the email I'm proposing to take over to the main mailing list to 
get some actual decisions made. As I don't agree with some of the 
possible recommendations, I want to make sure that they're represented 
fairly.

I also want to summarise the background leading to why we should 
consider making a change here at all, rather than simply leaving it 
alone. There's a chance this will all make its way into a PEP, depending 
on how controversial the core team thinks this is.

Please let me know if you think I've misrepresented (or unfairly 
represented) any of the positions, or if you think I can 
simplify/clarify anything in here. Please don't treat this like a PEP 
review - it's just going to be an email to python-dev - but the more we 
can avoid having the discussions there we've already had here the better.

Cheers,
Steve

---

Background
==========

File system paths are almost universally represented as text in some 
encoding determined by the file system. In Python, we expose these paths 
via a number of interfaces, such as the os and io modules. Paths may be 
passed either direction across these interfaces, that is, from the 
filesystem to the application (for example, os.listdir()), or from the 
application to the filesystem (for example, os.unlink()).

When paths are passed between the filesystem and the application, they 
are either passed through as a bytes blob or converted to/from str using 
sys.getfilesystemencoding(). The result of encoding a string with 
sys.getfilesystemencoding() is a blob of bytes in the native format for 
the default file system.

On Windows, the native format for the filesystem is utf-16-le. The 
recommended platform APIs for accessing the filesystem all accept and 
return text encoded in this format. However, prior to Windows NT (and 
possibly further back), the native format was a configurable machine 
option and a separate set of APIs existed to accept this format. The 
option (the "active code page") and these APIs (the "*A functions") 
still exist in recent versions of Windows for backwards compatibility, 
though new functionality often only has a utf-16-le API (the "*W 
functions").

In Python, we recommend using str as the default format on Windows 
because it can correctly round-trip all the characters representable in 
utf-16-le. Our support for bytes explicitly uses the *A functions and 
hence the encoding for the bytes is "whatever the active code page is". 
Since the active code page cannot represent all Unicode characters, the 
conversion of a path into bytes can lose information without warning.

As a demonstration of this:

 >>> open('test\uAB00.txt', 'wb').close()
 >>> import glob
 >>> glob.glob('test*')
['test\uab00.txt']
 >>> glob.glob(b'test*')
[b'test?.txt']

The Unicode character in the second call to glob is missing information. 
You can observe the same results in os.listdir() or any function that 
matches its result type to the parameter type.

Why is this a problem?
======================

While the obvious and correct answer is to just use str everywhere, it 
remains well known that on Linux and MacOS it is perfectly okay to use 
bytes when taking values from the filesystem and passing them back. 
Doing so also avoids the cost of decoding and reencoding, such that 
(theoretically), code like below should be faster because of the `b'.'`:

 >>> for f in os.listdir(b'.'):
...     os.stat(f)
...

On Windows, if a filename exists that cannot be encoding with the active 
code page, you will receive an error from the above code. These errors 
are why in Python 3.3 the use of bytes paths on Windows was deprecated 
(listed in the What's New, but not clearly obvious in the documentation 
- more on this later). The above code produces multiple deprecation 
warnings in 3.3, 3.4 and 3.5 on Windows.

However, we still keep seeing libraries use bytes paths, which can cause 
unexpected issues on Windows. Given the current approach of quietly 
recommending that library developers either write their code twice (once 
for bytes and once for str) or use str exclusively are not working, we 
should consider alternative mitigations.

Proposals
=========

There are two dimensions here - the fix and the timing. We can basically 
choose any fix and any timing.

The main differences between the fixes are the balance between incorrect 
behaviour and backwards-incompatible behaviour. The main issue with 
respect to timing is whether or not we believe using bytes as paths on 
Windows was correctly deprecated in 3.3 and sufficiently advertised 
since to allow us to change the behaviour in 3.6.

Fixes
-----

Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows

Currently the default filesystem encoding is 'mbcs', which is a 
meta-encoder that uses the active code page. In reality, our 
implementation uses the *A APIs and we don't explicitly decode bytes in 
order to pass them to the filesystem. This allows the OS to quietly 
replace invalid characters (the equivalent of 'mbcs:replace').

This proposal would remove all use of the *A APIs and only ever call the 
*W APIs. When paths are returned to Python as str, they will be decoded 
from utf-16-le. When paths are to be returned as bytes, we would decode 
from utf-16-le to utf-8 using surrogatepass. Equally, when paths are 
provided as bytes, they are decoded from utf-8 into utf-16-le and passed 
to the *W APIs.

The choice of utf-8 is to ensure the ability to round-trip, while also 
allowing basic manipulation of paths as bytes (basically, locating and 
slicing at '\' characters).

It is debated, but I believe this is not a backwards compatibility issue 
because:
* byte paths in Python are specified as being encoded by 
sys.getfilesystemencoding()
* byte paths on Windows have been deprecated for three versions

Unfortunately, the deprecation is not explicitly called out anywhere in 
the docs apart from the What's New page, so there is an argument that it 
shouldn't be counted despite the warnings in the interpreter. However, 
this is more directly addressed in the discussion of timing below.

Equally, sys.getfilesystemencoding() documents the specific return 
values for various platforms, as well as that it is part of the protocol 
for using bytes to represent filesystem strings.

I believe both of these arguments are invalid, that the only code that 
will break as a result of this change is relying on deprecated 
functionality and not correctly following the encoding contract, and 
that the (probably noisy) breakage that will occur is less bad than the 
silent breakage that currently exists.

As far as implementation goes, there is already a patch for this at 
http://bugs.python.org/issue27781. In short, we update the path 
converter to decode bytes (path->narrow) to Unicode (path->wide) and 
remove all the code that would call *A APIs. In my patch I've changed 
path->narrow to a flag that indicates whether to convert back to bytes 
on return, and also to prevent compilation of code that tries to use 
->narrow as a string on Windows (maybe that will get too annoying for 
contributors? good discussion for the tracker IMHO).

Fix #2: Do the mbcs decoding ourselves

This is essentially the same as fix #1, but instead of changing to utf-8 
we keep mbcs as the encoding.

This approach will allow us to utilise new functionality that is only 
available as *W APIs, and also lets us be more strict about 
encoding/decoding to bytes. For example, rather than silently replacing 
Unicode characters with '?', we could warn or fail the operation, 
potentially modifying that behaviour with an environment variable or flag.

Compared to fix #1, this will enable some new functionality but will not 
fix any of the problems immediately. New runtime errors may cause some 
problems to be more obvious and lead to fixes, provided library 
maintainers are interested in supporting Windows and adding a separate 
code path to treat filesystem paths as strings.

Fix #3: Make bytes paths on Windows an error

By preventing the use of bytes paths on Windows completely we prevent 
users from hitting encoding issues. However, we do this at the expense 
of usability.

I don't have numbers of libraries that will simply fail on Windows if 
this "fix" is made, but given I've already had people directly email me 
and tell me about their problems we can safely assume it's non-zero.

I'm really not a fan of this fix, because it doesn't actually make 
things better in a practical way, despite being more "pure".

Timing #1: Change it in 3.6

This timing assumes that we believe the deprecation of using bytes for 
paths in Python 3.3 was sufficiently well advertised that we can freely 
make changes in 3.6. A typical deprecation cycle would be two versions 
before removal (though we also often leave things in forever when they 
aren't fundamentally broken), so we have passed that point and 
theoretically can remove or change the functionality without breaking it.

In this case, we would announce in 3.6 that using bytes as paths on 
Windows is no longer deprecated, and that the encoding used is whatever 
is returned by sys.getfilesystemencoding().

Timing #2: Change it in 3.7

This timing assumes that the deprecation in 3.3 was valid, but 
acknowledges that it was not well publicised. For 3.6, we aggressively 
make it known that only strings should be used to represent paths on 
Windows and bytes are invalid and going to change in 3.7. (It has been 
suggested that I could use a keynote at PyCon to publicise this, and 
while I'd totally accept a keynote, I'd hate to subject a crowd to just 
this issue for an hour :) ).

My concern with this approach is that there is no benefit to the change 
at all. If we aggressively publicise the fact that libraries that don't 
handle Unicode paths on Windows properly are using deprecated 
functionality and need to be fixed by 3.7 in order to avoid breaking 
(more precisely - continuing to be broken, but with a different error 
message), then we will alienate non-Windows developers further from the 
platform (net loss for the ecosystem) and convince some to switch to str 
everywhere (net gain for the ecosystem). The latter case removes the 
need to make any change in 3.7 at all, so we would really just be making 
noise about something that people haven't noticed and not necessarily 
going in and fixing anything.

Timing #3: Change it in 3.8

This timing assumes that the deprecation in 3.3 was not sufficient and 
we need to start a new deprecation cycle. This is strengthened by the 
fact that the deprecation announcement does not explicitly include the 
io module or the builtin open() function, and so some developers may 
believe that using bytes for paths with these is okay despite the os 
module being deprecated.

The one upside to this approach is that it would also allow us to change 
locale.getpreferredencoding() to utf-8 on Windows (to affect the default 
behaviour of open(..., 'r') ), which I don't believe is going to be 
possible without a new deprecation cycle. There is a strong argument 
that the following code should also round-trip regardless of platform:

 >>> with open('list.txt', 'w') as f:
...     for i in os.listdir('.'):
...         print(i, file=f)
...
 >>> with open('list.txt', 'r') as f:
...     files = list(f)
...

Currently, the default encoding for open() cannot represent all 
filenames that may be returned from listdir(). This may affect makefiles 
and configuration files that contain paths. Currently they will work 
correctly for paths that can be represented in the machine's active code 
page (though it should be noted that the *A APIs may be changed to use 
the OEM code page rather than the active code page, which would also 
break this case).

Possibly resolving both issues simultaneously is worth waiting for two 
more releases? I'm not convinced the change to getfilesystemencoding() 
needs to wait for getpreferredencoding() to also change, or that they 
necessarily need to match, but it would not be hugely surprising to see 
the changes bundled together.

I'll also note that there has been no discussion about changing 
getpreferredencoding() so far, though there have been a number of "+1" 
votes alongside some "+1 with significant concerns" votes. Changing the 
default encoding of the contents of data files is pretty scary, so I'm 
not in any rush to force it in.

Acknowledgements
================

Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for 
their significant contributions and willingness to engage, and to 
everyone else on python-ideas for contributing to the discussion.