Re: [Python-ideas] Fix default encodings on Windows

16 Aug 2016

      On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
...
I just want to clearly address two points, since I feel like multiple posts
have been unclear on them.
1. The bytes API was deprecated in 3.3 and it is listed in
https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is
an unfortunate oversight, but it was certainly announced and the warning has
been there for three released versions. We can freely change or remove the
support now, IMHO.
For clarity, the statement was:

"""
issue 13374: The Windows bytes API has been deprecated in the os
module. Use Unicode filenames, instead of bytes filenames, to not
depend on the ANSI code page anymore and to support any filename.
"""

First of all, note that I'm perfectly OK with deprecating bytes paths.
However, this statement specifically does *not* say anything about use
of bytes paths outside of the os module (builtin open and the io
module being the obvious places). Secondly, it appears that
unfortunately the main Python documentation wasn't updated to state
this.

So while "we can freely change or remove the support now" may be true,
it's not that simple - the debate here is at least in part about
builtin open, and there's nothing anywhere that I can see that states
that bytes support in open has been deprecated. Maybe there should
have been, and maybe everyone involved at the time assumed that it
was, but that's water under the bridge.
...
2. Windows file system encoding is *always* UTF-16. There's no "assuming
mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
it is". We know exactly what the encoding is on every supported version of
Windows. UTF-16.
This discussion is for the developers who insist on using bytes for paths
within Python, and the question is, "how do we best represent UTF-16 encoded
paths in bytes?"
People passing bytes to open() have in my view, already chosen not to
follow the standard advice of "decode incoming data at the boundaries
of your application". They may have good reasons for that, but it's
perfectly reasonable to expect them to take responsibility for
manually tracking the encoding of the resulting bytes values flowing
through their code. It is of course, also true that "works for me in
my environment" is a viable strategy - but the maintenance cost of
this strategy if things change (whether in Python, or in the
environment) is on the application developers - they are hoping that
cost is minimal, but that's a risk they choose to take.
...
The choices are:
* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
Actually, with the exception of the last one (which seems "obviously
not sensible") these all feel more to me like answers to the question
"how do we best interpret bytes provided to us as UTF-16?". It's a
subtle point, but IMO important. It's much easier to answer the
question you posed, but what people are actually concerned about is
interpreting bytes, not representing Unicode. The correct answer to
"how do we interpret bytes" is "in the face of ambiguity, refuse to
guess" - but people using the bytes API have *already* bought into the
current heuristic for guessing, so changing affects them.
...
Currently we have the second option.
My preference is the fourth option, as it will cause the least breakage of
existing code and enable the most amount of code to just work in the
presence of non-ACP characters.
It changes the encoding used to interpret bytes. While it preserves
more information in the "UTF-16 to bytes" direction, nobody really
cares about that direction. And in the "bytes to UTF-16" direction, it
changes the interpretation of basically all non-ASCII bytes. That's a
lot of breakage. Although as already noted, it's only breaking things
that currently work while relying on a (maybe) undocumented API (byte
paths to builtin open isn't actually documented) and on an arguably
bad default that nevertheless works for them.
...
The fifth option is the best for round-tripping within Windows APIs.
The only code that will break with any change is code that was using an
already deprecated API. Code that correctly uses str to represent "encoding
agnostic text" is unaffected.
Code using Unicode is unaffected, certainly. Ideally that means that
only a tiny minority of users should be affected. Are we over-reacting
to reports of standard practices in Japan? I've no idea.
...
If you see an alternative choice to those listed above, feel free to
contribute it. Otherwise, can we focus the discussion on these (or any new)
choices?
Accept that we should have deprecated builtin open and the io module,
but didn't do so. Extend the existing deprecation of bytes paths on
Windows, to cover *all* APIs, not just the os module, But modify the
deprecation to be "use of the Windows CP_ACP code page (via the ...A
Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
the implied encoding for all bytes paths on Windows starting in Python
3.7". Document and publicise it much more prominently, as it is a
breaking change. Then leave it one release for people to prepare for
the change.

Oh, and (obviously) check back with Guido on his view - he's expressed
concern, but I for one don't have the slightest idea in this case what
his preference would be...

Paul

Re: [Python-ideas] Fix default encodings on Windows

Paul Moore