Summary for python-dev. This is the email I'm proposing to take over to the main mailing list to get some actual decisions made. As I don't agree with some of the possible recommendations, I want to make sure that they're represented fairly. I also want to summarise the background leading to why we should consider making a change here at all, rather than simply leaving it alone. There's a chance this will all make its way into a PEP, depending on how controversial the core team thinks this is. Please let me know if you think I've misrepresented (or unfairly represented) any of the positions, or if you think I can simplify/clarify anything in here. Please don't treat this like a PEP review - it's just going to be an email to python-dev - but the more we can avoid having the discussions there we've already had here the better. Cheers, Steve --- Background ========== File system paths are almost universally represented as text in some encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the os and io modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, os.listdir()), or from the application to the filesystem (for example, os.unlink()). When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using sys.getfilesystemencoding(). The result of encoding a string with sys.getfilesystemencoding() is a blob of bytes in the native format for the default file system. On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the "active code page") and these APIs (the "*A functions") still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the "*W functions"). In Python, we recommend using str as the default format on Windows because it can correctly round-trip all the characters representable in utf-16-le. Our support for bytes explicitly uses the *A functions and hence the encoding for the bytes is "whatever the active code page is". Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning. As a demonstration of this:
open('test\uAB00.txt', 'wb').close() import glob glob.glob('test*') ['test\uab00.txt'] glob.glob(b'test*') [b'test?.txt']
The Unicode character in the second call to glob is missing information. You can observe the same results in os.listdir() or any function that matches its result type to the parameter type. Why is this a problem? ====================== While the obvious and correct answer is to just use str everywhere, it remains well known that on Linux and MacOS it is perfectly okay to use bytes when taking values from the filesystem and passing them back. Doing so also avoids the cost of decoding and reencoding, such that (theoretically), code like below should be faster because of the `b'.'`:
for f in os.listdir(b'.'): ... os.stat(f) ...
On Windows, if a filename exists that cannot be encoding with the active code page, you will receive an error from the above code. These errors are why in Python 3.3 the use of bytes paths on Windows was deprecated (listed in the What's New, but not clearly obvious in the documentation - more on this later). The above code produces multiple deprecation warnings in 3.3, 3.4 and 3.5 on Windows. However, we still keep seeing libraries use bytes paths, which can cause unexpected issues on Windows. Given the current approach of quietly recommending that library developers either write their code twice (once for bytes and once for str) or use str exclusively are not working, we should consider alternative mitigations. Proposals ========= There are two dimensions here - the fix and the timing. We can basically choose any fix and any timing. The main differences between the fixes are the balance between incorrect behaviour and backwards-incompatible behaviour. The main issue with respect to timing is whether or not we believe using bytes as paths on Windows was correctly deprecated in 3.3 and sufficiently advertised since to allow us to change the behaviour in 3.6. Fixes ----- Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows Currently the default filesystem encoding is 'mbcs', which is a meta-encoder that uses the active code page. In reality, our implementation uses the *A APIs and we don't explicitly decode bytes in order to pass them to the filesystem. This allows the OS to quietly replace invalid characters (the equivalent of 'mbcs:replace'). This proposal would remove all use of the *A APIs and only ever call the *W APIs. When paths are returned to Python as str, they will be decoded from utf-16-le. When paths are to be returned as bytes, we would decode from utf-16-le to utf-8 using surrogatepass. Equally, when paths are provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the *W APIs. The choice of utf-8 is to ensure the ability to round-trip, while also allowing basic manipulation of paths as bytes (basically, locating and slicing at '\' characters). It is debated, but I believe this is not a backwards compatibility issue because: * byte paths in Python are specified as being encoded by sys.getfilesystemencoding() * byte paths on Windows have been deprecated for three versions Unfortunately, the deprecation is not explicitly called out anywhere in the docs apart from the What's New page, so there is an argument that it shouldn't be counted despite the warnings in the interpreter. However, this is more directly addressed in the discussion of timing below. Equally, sys.getfilesystemencoding() documents the specific return values for various platforms, as well as that it is part of the protocol for using bytes to represent filesystem strings. I believe both of these arguments are invalid, that the only code that will break as a result of this change is relying on deprecated functionality and not correctly following the encoding contract, and that the (probably noisy) breakage that will occur is less bad than the silent breakage that currently exists. As far as implementation goes, there is already a patch for this at http://bugs.python.org/issue27781. In short, we update the path converter to decode bytes (path->narrow) to Unicode (path->wide) and remove all the code that would call *A APIs. In my patch I've changed path->narrow to a flag that indicates whether to convert back to bytes on return, and also to prevent compilation of code that tries to use ->narrow as a string on Windows (maybe that will get too annoying for contributors? good discussion for the tracker IMHO). Fix #2: Do the mbcs decoding ourselves This is essentially the same as fix #1, but instead of changing to utf-8 we keep mbcs as the encoding. This approach will allow us to utilise new functionality that is only available as *W APIs, and also lets us be more strict about encoding/decoding to bytes. For example, rather than silently replacing Unicode characters with '?', we could warn or fail the operation, potentially modifying that behaviour with an environment variable or flag. Compared to fix #1, this will enable some new functionality but will not fix any of the problems immediately. New runtime errors may cause some problems to be more obvious and lead to fixes, provided library maintainers are interested in supporting Windows and adding a separate code path to treat filesystem paths as strings. Fix #3: Make bytes paths on Windows an error By preventing the use of bytes paths on Windows completely we prevent users from hitting encoding issues. However, we do this at the expense of usability. I don't have numbers of libraries that will simply fail on Windows if this "fix" is made, but given I've already had people directly email me and tell me about their problems we can safely assume it's non-zero. I'm really not a fan of this fix, because it doesn't actually make things better in a practical way, despite being more "pure". Timing #1: Change it in 3.6 This timing assumes that we believe the deprecation of using bytes for paths in Python 3.3 was sufficiently well advertised that we can freely make changes in 3.6. A typical deprecation cycle would be two versions before removal (though we also often leave things in forever when they aren't fundamentally broken), so we have passed that point and theoretically can remove or change the functionality without breaking it. In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, and that the encoding used is whatever is returned by sys.getfilesystemencoding(). Timing #2: Change it in 3.7 This timing assumes that the deprecation in 3.3 was valid, but acknowledges that it was not well publicised. For 3.6, we aggressively make it known that only strings should be used to represent paths on Windows and bytes are invalid and going to change in 3.7. (It has been suggested that I could use a keynote at PyCon to publicise this, and while I'd totally accept a keynote, I'd hate to subject a crowd to just this issue for an hour :) ). My concern with this approach is that there is no benefit to the change at all. If we aggressively publicise the fact that libraries that don't handle Unicode paths on Windows properly are using deprecated functionality and need to be fixed by 3.7 in order to avoid breaking (more precisely - continuing to be broken, but with a different error message), then we will alienate non-Windows developers further from the platform (net loss for the ecosystem) and convince some to switch to str everywhere (net gain for the ecosystem). The latter case removes the need to make any change in 3.7 at all, so we would really just be making noise about something that people haven't noticed and not necessarily going in and fixing anything. Timing #3: Change it in 3.8 This timing assumes that the deprecation in 3.3 was not sufficient and we need to start a new deprecation cycle. This is strengthened by the fact that the deprecation announcement does not explicitly include the io module or the builtin open() function, and so some developers may believe that using bytes for paths with these is okay despite the os module being deprecated. The one upside to this approach is that it would also allow us to change locale.getpreferredencoding() to utf-8 on Windows (to affect the default behaviour of open(..., 'r') ), which I don't believe is going to be possible without a new deprecation cycle. There is a strong argument that the following code should also round-trip regardless of platform:
with open('list.txt', 'w') as f: ... for i in os.listdir('.'): ... print(i, file=f) ... with open('list.txt', 'r') as f: ... files = list(f) ...
Currently, the default encoding for open() cannot represent all filenames that may be returned from listdir(). This may affect makefiles and configuration files that contain paths. Currently they will work correctly for paths that can be represented in the machine's active code page (though it should be noted that the *A APIs may be changed to use the OEM code page rather than the active code page, which would also break this case). Possibly resolving both issues simultaneously is worth waiting for two more releases? I'm not convinced the change to getfilesystemencoding() needs to wait for getpreferredencoding() to also change, or that they necessarily need to match, but it would not be hugely surprising to see the changes bundled together. I'll also note that there has been no discussion about changing getpreferredencoding() so far, though there have been a number of "+1" votes alongside some "+1 with significant concerns" votes. Changing the default encoding of the contents of data files is pretty scary, so I'm not in any rush to force it in. Acknowledgements ================ Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for their significant contributions and willingness to engage, and to everyone else on python-ideas for contributing to the discussion.