
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
I just want to clearly address two points, since I feel like multiple posts have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is an unfortunate oversight, but it was certainly announced and the warning has been there for three released versions. We can freely change or remove the support now, IMHO.
For clarity, the statement was: """ issue 13374: The Windows bytes API has been deprecated in the os module. Use Unicode filenames, instead of bytes filenames, to not depend on the ANSI code page anymore and to support any filename. """ First of all, note that I'm perfectly OK with deprecating bytes paths. However, this statement specifically does *not* say anything about use of bytes paths outside of the os module (builtin open and the io module being the obvious places). Secondly, it appears that unfortunately the main Python documentation wasn't updated to state this. So while "we can freely change or remove the support now" may be true, it's not that simple - the debate here is at least in part about builtin open, and there's nothing anywhere that I can see that states that bytes support in open has been deprecated. Maybe there should have been, and maybe everyone involved at the time assumed that it was, but that's water under the bridge.
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
This discussion is for the developers who insist on using bytes for paths within Python, and the question is, "how do we best represent UTF-16 encoded paths in bytes?"
People passing bytes to open() have in my view, already chosen not to follow the standard advice of "decode incoming data at the boundaries of your application". They may have good reasons for that, but it's perfectly reasonable to expect them to take responsibility for manually tracking the encoding of the resulting bytes values flowing through their code. It is of course, also true that "works for me in my environment" is a viable strategy - but the maintenance cost of this strategy if things change (whether in Python, or in the environment) is on the application developers - they are hoping that cost is minimal, but that's a risk they choose to take.
The choices are:
* don't represent them at all (remove bytes API) * convert and drop characters not in the (legacy) active code page * convert and fail on characters not in the (legacy) active code page * convert and fail on invalid surrogate pairs * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
Actually, with the exception of the last one (which seems "obviously not sensible") these all feel more to me like answers to the question "how do we best interpret bytes provided to us as UTF-16?". It's a subtle point, but IMO important. It's much easier to answer the question you posed, but what people are actually concerned about is interpreting bytes, not representing Unicode. The correct answer to "how do we interpret bytes" is "in the face of ambiguity, refuse to guess" - but people using the bytes API have *already* bought into the current heuristic for guessing, so changing affects them.
Currently we have the second option.
My preference is the fourth option, as it will cause the least breakage of existing code and enable the most amount of code to just work in the presence of non-ACP characters.
It changes the encoding used to interpret bytes. While it preserves more information in the "UTF-16 to bytes" direction, nobody really cares about that direction. And in the "bytes to UTF-16" direction, it changes the interpretation of basically all non-ASCII bytes. That's a lot of breakage. Although as already noted, it's only breaking things that currently work while relying on a (maybe) undocumented API (byte paths to builtin open isn't actually documented) and on an arguably bad default that nevertheless works for them.
The fifth option is the best for round-tripping within Windows APIs.
The only code that will break with any change is code that was using an already deprecated API. Code that correctly uses str to represent "encoding agnostic text" is unaffected.
Code using Unicode is unaffected, certainly. Ideally that means that only a tiny minority of users should be affected. Are we over-reacting to reports of standard practices in Japan? I've no idea.
If you see an alternative choice to those listed above, feel free to contribute it. Otherwise, can we focus the discussion on these (or any new) choices?
Accept that we should have deprecated builtin open and the io module, but didn't do so. Extend the existing deprecation of bytes paths on Windows, to cover *all* APIs, not just the os module, But modify the deprecation to be "use of the Windows CP_ACP code page (via the ...A Win32 APIs) is deprecated and will be replaced with use of UTF-8 as the implied encoding for all bytes paths on Windows starting in Python 3.7". Document and publicise it much more prominently, as it is a breaking change. Then leave it one release for people to prepare for the change. Oh, and (obviously) check back with Guido on his view - he's expressed concern, but I for one don't have the slightest idea in this case what his preference would be... Paul