[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Tue Aug 16 11:56:57 EDT 2016
I just want to clearly address two points, since I feel like multiple
posts have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in
https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs
is an unfortunate oversight, but it was certainly announced and the
warning has been there for three released versions. We can freely change
or remove the support now, IMHO.
2. Windows file system encoding is *always* UTF-16. There's no "assuming
mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what
encoding it is". We know exactly what the encoding is on every supported
version of Windows. UTF-16.
This discussion is for the developers who insist on using bytes for
paths within Python, and the question is, "how do we best represent
UTF-16 encoded paths in bytes?"
The choices are:
* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
Currently we have the second option.
My preference is the fourth option, as it will cause the least breakage
of existing code and enable the most amount of code to just work in the
presence of non-ACP characters.
The fifth option is the best for round-tripping within Windows APIs.
The only code that will break with any change is code that was using an
already deprecated API. Code that correctly uses str to represent
"encoding agnostic text" is unaffected.
If you see an alternative choice to those listed above, feel free to
contribute it. Otherwise, can we focus the discussion on these (or any
new) choices?
Cheers,
Steve
More information about the Python-ideas
mailing list