[Python-ideas] Fix default encodings on Windows

Tue Aug 16 11:56:57 EDT 2016

I just want to clearly address two points, since I feel like multiple 
posts have been unclear on them.

1. The bytes API was deprecated in 3.3 and it is listed in 
https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs 
is an unfortunate oversight, but it was certainly announced and the 
warning has been there for three released versions. We can freely change 
or remove the support now, IMHO.

2. Windows file system encoding is *always* UTF-16. There's no "assuming 
mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what 
encoding it is". We know exactly what the encoding is on every supported 
version of Windows. UTF-16.

This discussion is for the developers who insist on using bytes for 
paths within Python, and the question is, "how do we best represent 
UTF-16 encoded paths in bytes?"

The choices are:

* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

Currently we have the second option.

My preference is the fourth option, as it will cause the least breakage 
of existing code and enable the most amount of code to just work in the 
presence of non-ACP characters.

The fifth option is the best for round-tripping within Windows APIs.

The only code that will break with any change is code that was using an 
already deprecated API. Code that correctly uses str to represent 
"encoding agnostic text" is unaffected.

If you see an alternative choice to those listed above, feel free to 
contribute it. Otherwise, can we focus the discussion on these (or any 
new) choices?

Cheers,
Steve