[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Thu Aug 18 11:25:33 EDT 2016
Summary for python-dev.
This is the email I'm proposing to take over to the main mailing list to
get some actual decisions made. As I don't agree with some of the
possible recommendations, I want to make sure that they're represented
fairly.
I also want to summarise the background leading to why we should
consider making a change here at all, rather than simply leaving it
alone. There's a chance this will all make its way into a PEP, depending
on how controversial the core team thinks this is.
Please let me know if you think I've misrepresented (or unfairly
represented) any of the positions, or if you think I can
simplify/clarify anything in here. Please don't treat this like a PEP
review - it's just going to be an email to python-dev - but the more we
can avoid having the discussions there we've already had here the better.
Cheers,
Steve
---
Background
==========
File system paths are almost universally represented as text in some
encoding determined by the file system. In Python, we expose these paths
via a number of interfaces, such as the os and io modules. Paths may be
passed either direction across these interfaces, that is, from the
filesystem to the application (for example, os.listdir()), or from the
application to the filesystem (for example, os.unlink()).
When paths are passed between the filesystem and the application, they
are either passed through as a bytes blob or converted to/from str using
sys.getfilesystemencoding(). The result of encoding a string with
sys.getfilesystemencoding() is a blob of bytes in the native format for
the default file system.
On Windows, the native format for the filesystem is utf-16-le. The
recommended platform APIs for accessing the filesystem all accept and
return text encoded in this format. However, prior to Windows NT (and
possibly further back), the native format was a configurable machine
option and a separate set of APIs existed to accept this format. The
option (the "active code page") and these APIs (the "*A functions")
still exist in recent versions of Windows for backwards compatibility,
though new functionality often only has a utf-16-le API (the "*W
functions").
In Python, we recommend using str as the default format on Windows
because it can correctly round-trip all the characters representable in
utf-16-le. Our support for bytes explicitly uses the *A functions and
hence the encoding for the bytes is "whatever the active code page is".
Since the active code page cannot represent all Unicode characters, the
conversion of a path into bytes can lose information without warning.
As a demonstration of this:
>>> open('test\uAB00.txt', 'wb').close()
>>> import glob
>>> glob.glob('test*')
['test\uab00.txt']
>>> glob.glob(b'test*')
[b'test?.txt']
The Unicode character in the second call to glob is missing information.
You can observe the same results in os.listdir() or any function that
matches its result type to the parameter type.
Why is this a problem?
======================
While the obvious and correct answer is to just use str everywhere, it
remains well known that on Linux and MacOS it is perfectly okay to use
bytes when taking values from the filesystem and passing them back.
Doing so also avoids the cost of decoding and reencoding, such that
(theoretically), code like below should be faster because of the `b'.'`:
>>> for f in os.listdir(b'.'):
... os.stat(f)
...
On Windows, if a filename exists that cannot be encoding with the active
code page, you will receive an error from the above code. These errors
are why in Python 3.3 the use of bytes paths on Windows was deprecated
(listed in the What's New, but not clearly obvious in the documentation
- more on this later). The above code produces multiple deprecation
warnings in 3.3, 3.4 and 3.5 on Windows.
However, we still keep seeing libraries use bytes paths, which can cause
unexpected issues on Windows. Given the current approach of quietly
recommending that library developers either write their code twice (once
for bytes and once for str) or use str exclusively are not working, we
should consider alternative mitigations.
Proposals
=========
There are two dimensions here - the fix and the timing. We can basically
choose any fix and any timing.
The main differences between the fixes are the balance between incorrect
behaviour and backwards-incompatible behaviour. The main issue with
respect to timing is whether or not we believe using bytes as paths on
Windows was correctly deprecated in 3.3 and sufficiently advertised
since to allow us to change the behaviour in 3.6.
Fixes
-----
Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows
Currently the default filesystem encoding is 'mbcs', which is a
meta-encoder that uses the active code page. In reality, our
implementation uses the *A APIs and we don't explicitly decode bytes in
order to pass them to the filesystem. This allows the OS to quietly
replace invalid characters (the equivalent of 'mbcs:replace').
This proposal would remove all use of the *A APIs and only ever call the
*W APIs. When paths are returned to Python as str, they will be decoded
from utf-16-le. When paths are to be returned as bytes, we would decode
from utf-16-le to utf-8 using surrogatepass. Equally, when paths are
provided as bytes, they are decoded from utf-8 into utf-16-le and passed
to the *W APIs.
The choice of utf-8 is to ensure the ability to round-trip, while also
allowing basic manipulation of paths as bytes (basically, locating and
slicing at '\' characters).
It is debated, but I believe this is not a backwards compatibility issue
because:
* byte paths in Python are specified as being encoded by
sys.getfilesystemencoding()
* byte paths on Windows have been deprecated for three versions
Unfortunately, the deprecation is not explicitly called out anywhere in
the docs apart from the What's New page, so there is an argument that it
shouldn't be counted despite the warnings in the interpreter. However,
this is more directly addressed in the discussion of timing below.
Equally, sys.getfilesystemencoding() documents the specific return
values for various platforms, as well as that it is part of the protocol
for using bytes to represent filesystem strings.
I believe both of these arguments are invalid, that the only code that
will break as a result of this change is relying on deprecated
functionality and not correctly following the encoding contract, and
that the (probably noisy) breakage that will occur is less bad than the
silent breakage that currently exists.
As far as implementation goes, there is already a patch for this at
http://bugs.python.org/issue27781. In short, we update the path
converter to decode bytes (path->narrow) to Unicode (path->wide) and
remove all the code that would call *A APIs. In my patch I've changed
path->narrow to a flag that indicates whether to convert back to bytes
on return, and also to prevent compilation of code that tries to use
->narrow as a string on Windows (maybe that will get too annoying for
contributors? good discussion for the tracker IMHO).
Fix #2: Do the mbcs decoding ourselves
This is essentially the same as fix #1, but instead of changing to utf-8
we keep mbcs as the encoding.
This approach will allow us to utilise new functionality that is only
available as *W APIs, and also lets us be more strict about
encoding/decoding to bytes. For example, rather than silently replacing
Unicode characters with '?', we could warn or fail the operation,
potentially modifying that behaviour with an environment variable or flag.
Compared to fix #1, this will enable some new functionality but will not
fix any of the problems immediately. New runtime errors may cause some
problems to be more obvious and lead to fixes, provided library
maintainers are interested in supporting Windows and adding a separate
code path to treat filesystem paths as strings.
Fix #3: Make bytes paths on Windows an error
By preventing the use of bytes paths on Windows completely we prevent
users from hitting encoding issues. However, we do this at the expense
of usability.
I don't have numbers of libraries that will simply fail on Windows if
this "fix" is made, but given I've already had people directly email me
and tell me about their problems we can safely assume it's non-zero.
I'm really not a fan of this fix, because it doesn't actually make
things better in a practical way, despite being more "pure".
Timing #1: Change it in 3.6
This timing assumes that we believe the deprecation of using bytes for
paths in Python 3.3 was sufficiently well advertised that we can freely
make changes in 3.6. A typical deprecation cycle would be two versions
before removal (though we also often leave things in forever when they
aren't fundamentally broken), so we have passed that point and
theoretically can remove or change the functionality without breaking it.
In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated, and that the encoding used is whatever
is returned by sys.getfilesystemencoding().
Timing #2: Change it in 3.7
This timing assumes that the deprecation in 3.3 was valid, but
acknowledges that it was not well publicised. For 3.6, we aggressively
make it known that only strings should be used to represent paths on
Windows and bytes are invalid and going to change in 3.7. (It has been
suggested that I could use a keynote at PyCon to publicise this, and
while I'd totally accept a keynote, I'd hate to subject a crowd to just
this issue for an hour :) ).
My concern with this approach is that there is no benefit to the change
at all. If we aggressively publicise the fact that libraries that don't
handle Unicode paths on Windows properly are using deprecated
functionality and need to be fixed by 3.7 in order to avoid breaking
(more precisely - continuing to be broken, but with a different error
message), then we will alienate non-Windows developers further from the
platform (net loss for the ecosystem) and convince some to switch to str
everywhere (net gain for the ecosystem). The latter case removes the
need to make any change in 3.7 at all, so we would really just be making
noise about something that people haven't noticed and not necessarily
going in and fixing anything.
Timing #3: Change it in 3.8
This timing assumes that the deprecation in 3.3 was not sufficient and
we need to start a new deprecation cycle. This is strengthened by the
fact that the deprecation announcement does not explicitly include the
io module or the builtin open() function, and so some developers may
believe that using bytes for paths with these is okay despite the os
module being deprecated.
The one upside to this approach is that it would also allow us to change
locale.getpreferredencoding() to utf-8 on Windows (to affect the default
behaviour of open(..., 'r') ), which I don't believe is going to be
possible without a new deprecation cycle. There is a strong argument
that the following code should also round-trip regardless of platform:
>>> with open('list.txt', 'w') as f:
... for i in os.listdir('.'):
... print(i, file=f)
...
>>> with open('list.txt', 'r') as f:
... files = list(f)
...
Currently, the default encoding for open() cannot represent all
filenames that may be returned from listdir(). This may affect makefiles
and configuration files that contain paths. Currently they will work
correctly for paths that can be represented in the machine's active code
page (though it should be noted that the *A APIs may be changed to use
the OEM code page rather than the active code page, which would also
break this case).
Possibly resolving both issues simultaneously is worth waiting for two
more releases? I'm not convinced the change to getfilesystemencoding()
needs to wait for getpreferredencoding() to also change, or that they
necessarily need to match, but it would not be hugely surprising to see
the changes bundled together.
I'll also note that there has been no discussion about changing
getpreferredencoding() so far, though there have been a number of "+1"
votes alongside some "+1 with significant concerns" votes. Changing the
default encoding of the contents of data files is pretty scary, so I'm
not in any rush to force it in.
Acknowledgements
================
Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for
their significant contributions and willingness to engage, and to
everyone else on python-ideas for contributing to the discussion.
More information about the Python-ideas
mailing list