[Python-ideas] Fix default encodings on Windows

Steve Dower steve.dower at python.org
Tue Aug 16 19:27:43 EDT 2016


On 16Aug2016 1603, Victor Stinner wrote:
> 2016-08-16 17:56 GMT+02:00 Steve Dower <steve.dower at python.org>:
>> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
>> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
>> it is". We know exactly what the encoding is on every supported version of
>> Windows. UTF-16.
>
> I think that you missed a important issue (or "use case") which is
> called the "Makefile problem" by Mercurial developers:
> https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22
>
> I already explained it before, but maybe you misunderstood or just
> missed it, so here is a more concrete example.

I guess I misunderstood. The concrete example really help, thank you.

The problem here is that there is an application boundary without a 
defined encoding, right where you put the comment.

>     filenameb = os.listdir(b'.')[0]
>     # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
>     # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
>     print("filename bytes: %a" % filenameb)
>
>     proc = subprocess.Popen(['py', '-2', script],
> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
>     stdout = proc.communicate(filenameb)[0]
>     print("File content: %a" % stdout)

If you are defining the encoding as 'mbcs', then you need to check that 
sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.

Alternatively, since this script is the "new" code, you would use 
`os.listdir('.')[0].encode('mbcs')`, given that you have explicitly 
determined that mbcs is the encoding for the later transfer.

Essentially, the problem is that this code is relying on a certain 
non-guaranteed behaviour of a deprecated API, where using 
sys.getfilesystemencoding() as documented would have prevented any issue 
(see 
https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables). 
In one of the emails I think you missed, I called this out as the only 
case where code will break with a change to sys.getfilesystemencoding().

So yes, breaking existing code is something I would never do lightly. 
However, I'm very much of the opinion that the only code that will break 
is code that is already broken (or at least fragile) and that nobody is 
forced to take a major upgrade to Python or should necessarily expect 
100% compatibility between major versions.

Cheers,
Steve


More information about the Python-ideas mailing list