[Python-ideas] Fix default encodings on Windows

Wed Aug 10 19:48:35 EDT 2016

On 10Aug2016 1630, Random832 wrote:
> On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
>> Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
>> locales that use a DBCS codepage such as 932.
>
> Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
> unless you intend to use surrogateescape (which you could also do with
> mbcs).
>
> Is there any particular reason to expect all bytes paths in this
> scenario to be valid UTF-8?

On Windows, all paths are effectively UCS-2 (they are defined as UTF-16, 
but surrogate pairs don't seem to be validated, which IIUC means it's 
really UCS-2), so while the majority can be encoded as valid UTF-8, 
there are some paths which cannot. (These paths are going to break many 
other tools though, such as PowerShell, so we won't be in bad company if 
we can't handle them properly in edge cases).

surrogateescape is irrelevant because it's only for decoding from bytes. 
An alternative approach would be to replace mbcs with a ucs-2 encoding 
that is basically just a blob of the path that was returned from Windows 
(using the Unicode APIs). None of the manipulation functions would work 
on this though, since nearly every second character would be \x00, but 
it's the only way (besides using str) to maintain full fidelity for 
every possible path name.

Compromising on UTF-8 is going to increase consistency across platforms 
and across different Windows installations without increasing the rate 
of errors above what we currently see (given that invalid characters are 
currently replaced with '?'). It's not a 100% solution, but it's a 99% 
solution where the 1% is not handled well by anyone.

Cheers,
Steve