Re: [Python-ideas] Fix default encodings on Windows

17 Aug 2016

      On 17Aug2016 0235, Stephen J. Turnbull wrote:
...
Paul Moore writes:
...
On 16 August 2016 at 16:56, Steve Dower <steve.dower@python.org> wrote:
...
...
This discussion is for the developers who insist on using bytes
for paths within Python, and the question is, "how do we best
represent UTF-16 encoded paths in bytes?"
That's incomplete, AFAICS.  (Paul makes this point somewhat
differently.)  We don't want to represent paths in bytes on Windows if
we can avoid it.  Nor does UTF-16 really enter into it (except for the
technical issue of invalid surrogate pairs).  So a full statement is,
"How do we best represent Windows file system paths in bytes for
interoperability with systems that natively represent paths in bytes?"
("Other systems" refers to both other platforms and existing programs
on Windows.)
That's incorrect, or at least possible to interpret correctly as the 
wrong thing. The goal is "code compatibility with systems ...", not 
interoperability.

Nothing about this will make it easier to take a path from Windows and 
use it on Linux or vice versa, but it will make it easier/more reliable 
to take code that uses paths on Linux and use it on Windows.
...
BTW, why "surrogate pairs"?  Does Windows validate surrogates to
ensure they come in pairs, but not necessarily in the right order (or
perhaps sometimes they resolve to non-characters such as U+1FFFF)?
Eryk answered this better than I would have.
...
Paul says:
...
People passing bytes to open() have in my view, already chosen not
to follow the standard advice of "decode incoming data at the
boundaries of your application". They may have good reasons for
that, but it's perfectly reasonable to expect them to take
responsibility for manually tracking the encoding of the resulting
bytes values flowing through their code.
Abstractly true, but in practice there's no such need for those who
made the choice!  In a properly set up POSIX locale[1], it Just Works by
design, especially if you use UTF-8 as the preferred encoding.  It's
Windows developers and users who suffer, not those who wrote the code,
nor their primary audience which uses POSIX platforms.
You mentioned "locale", "preferred" and "encoding" in the same sentence, 
so I hope you're not thinking of locale.getpreferredencoding()? Changing 
that function is orthogonal to this discussion, despite the fact that in 
most cases it returns the same code page as what is going to be used by 
the file system functions (which in most cases will also be used by the 
encoding returned from sys.getfilesystemencoding()).

When Windows developers and users suffer, I see it as my responsibility 
to reduce that suffering. Changing Python on Windows should do that 
without affecting developers on Linux, even though the Right Way is to 
change all the developers on Linux to use str for paths.
...
...
...
If you see an alternative choice to those listed above, feel free
to contribute it. Otherwise, can we focus the discussion on these
(or any new) choices?
Accept that we should have deprecated builtin open and the io module,
but didn't do so. Extend the existing deprecation of bytes paths on
Windows, to cover *all* APIs, not just the os module, But modify the
deprecation to be "use of the Windows CP_ACP code page (via the ...A
Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
the implied encoding for all bytes paths on Windows starting in Python
3.7". Document and publicise it much more prominently, as it is a
breaking change. Then leave it one release for people to prepare for
the change.
I like this one!  If my paranoid fears are realized, in practice it
might have to wait two releases, but at least this announcement should
get people who are at risk to speak up.  If they don't, then you can
just call me "Chicken Little" and go ahead!
I don't think there's any reasonable way to noisily deprecate these 
functions within Python, but certainly the docs can be made clearer. 
People who explicitly encode with sys.getfilesystemencoding() should not 
get the deprecation message, but we can't tell whether they got their 
bytes from the right encoding or a RNG, so there's no way to discriminate.

I'm going to put together a summary post here (hopefully today) and get 
those who have been contributing to basically sign off on it, then I'll 
take it to python-dev. The possible outcomes I'll propose will basically 
be "do we keep the status quo, undeprecate and change the functionality, 
deprecate the deprecation and undeprecate/change in a couple releases, 
or say that it wasn't a real deprecation so we can deprecate and then 
change functionality in a couple releases".

Cheers,
Steve