[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Wed Aug 10 19:48:35 EDT 2016
On 10Aug2016 1630, Random832 wrote:
> On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
>> Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
>> locales that use a DBCS codepage such as 932.
>
> Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
> unless you intend to use surrogateescape (which you could also do with
> mbcs).
>
> Is there any particular reason to expect all bytes paths in this
> scenario to be valid UTF-8?
On Windows, all paths are effectively UCS-2 (they are defined as UTF-16,
but surrogate pairs don't seem to be validated, which IIUC means it's
really UCS-2), so while the majority can be encoded as valid UTF-8,
there are some paths which cannot. (These paths are going to break many
other tools though, such as PowerShell, so we won't be in bad company if
we can't handle them properly in edge cases).
surrogateescape is irrelevant because it's only for decoding from bytes.
An alternative approach would be to replace mbcs with a ucs-2 encoding
that is basically just a blob of the path that was returned from Windows
(using the Unicode APIs). None of the manipulation functions would work
on this though, since nearly every second character would be \x00, but
it's the only way (besides using str) to maintain full fidelity for
every possible path name.
Compromising on UTF-8 is going to increase consistency across platforms
and across different Windows installations without increasing the rate
of errors above what we currently see (given that invalid characters are
currently replaced with '?'). It's not a 100% solution, but it's a 99%
solution where the 1% is not handled well by anyone.
Cheers,
Steve
More information about the Python-ideas
mailing list