[Python-ideas] Fix default encodings on Windows

Tue Aug 16 20:14:09 EDT 2016

On 16Aug2016 1650, Victor Stinner wrote:
> 2016-08-17 1:27 GMT+02:00 Steve Dower <steve.dower at python.org>:
>>>     filenameb = os.listdir(b'.')[0]
>>>     # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
>>>     # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
>>>     print("filename bytes: %a" % filenameb)
>>>
>>>     proc = subprocess.Popen(['py', '-2', script],
>>> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
>>>     stdout = proc.communicate(filenameb)[0]
>>>     print("File content: %a" % stdout)
>>
>>
>> If you are defining the encoding as 'mbcs', then you need to check that
>> sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
>
> Sorry, I don't understand. What do you mean by "defining an encoding"?
> It's not possible to modify sys.getfilesystemencoding() in Python.
> What does "reencode"? I'm lost.

You are transferring text between two applications without specifying 
what the encoding is. sys.getfilesystemencoding() does not apply to 
proc.communicate() - you can use your choice of encoding for 
communicating between two processes.

>> Alternatively, since this script is the "new" code, you would use
>> `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
>> determined that mbcs is the encoding for the later transfer.
>
> My example is not new code. It is a very simplified script to explain
> the issue that can occur in a large code base which *currently* works
> well on Python 2 and Pyhon 3 in the common case (only handle data
> encodable to the ANSI code page).

If you are planning to run it with Python 3.6, then I'd argue it's "new" 
code. When you don't want anything to change, you certainly don't change 
the major version of your runtime.

>> Essentially, the problem is that this code is relying on a certain
>> non-guaranteed behaviour of a deprecated API, where using
>> sys.getfilesystemencoding() as documented would have prevented any issue
>> (see
>> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).
>
> sys.getfilesystemencoding() is used in applications which store data
> as Unicode, but we are talking about applications storing data as
> bytes, no?

No, we're talking about how Python code communicates with the file 
system. Applications can store their data however they like, but when 
they pass it to a filesystem function they need to pass it as str or 
bytes encoding with sys.getfilesystemencoding() (this has always been 
the case).

>> So yes, breaking existing code is something I would never do lightly.
>> However, I'm very much of the opinion that the only code that will break is
>> code that is already broken (or at least fragile) and that nobody is forced
>> to take a major upgrade to Python or should necessarily expect 100%
>> compatibility between major versions.
>
> Well, it's somehow the same issue that we had in Python 2:
> applications work in most cases, but start to fail with non-ASCII
> characters, or maybe only in some cases.
>
> In this case, the ANSI code page is fine if all data can be encoded to
> the ANSI code page. You start to get troubles when you start to use
> characters not encodable to your ANSI code page. Last time I checked,
> Microsoft Visual Studio behaved badly (has bugs) with such filenames.
> It's the same for many applications. So it's not like Windows
> applications already handle this case very well. So let me call it a
> corner case.

The existence of bugs in other applications is not a good reason to help 
people create new bugs.

> I'm not sure that it's worth it to explicitly break the Python
> backward compatibility on Windows for such corner case, especially
> because it's already possible to fix applications by starting to use
> Unicode everywhere (which would likely fix more issues than expected
> as a side effect).
>
> It's still unclear to me if it's simpler to modify an application
> using bytes to start using Unicode (for filenames), or if your
> proposition requires less changes.

My proposition requires less changes *when you target multiple platforms 
and would prefer to use bytes*. It allows the below code to be written 
as either branch without losing the ability to round-trip whatever 
filename happens to be returned:

if os.name == 'nt':
     f = open(os.listdir('.')[-1])
else:
     f = open(os.listdir(b'.')[-1])

If you choose just the first branch (use str for paths), then you do get 
a better result. However, we have been telling people to do that since 
3.0 (and made it easier in 3.2 IIRC) and it's now 3.5 and they are still 
complaining about not getting to use bytes for paths. So rather than 
have people say "Windows support is too hard", this change enables the 
second branch to be used on all platforms.

> My main concern is the "makefile issue" which requires more complex
> code to transcode data between UTF-8 and ANSI code page. To me, it's
> like we are going back to Python 2 where no data had known encoding
> and mojibake was the default. If you manipulate strings in two
> encodings, it's likely to make mistakes and concatenate two strings
> encoded to two different encodings (=> mojibake).

Your makefile example is going back to Python 2, as it has no known 
encoding. If you want to associate an encoding with bytes, you decode it 
to text or you explicitly specify what the encoding should be. Your own 
example makes assumptions about what encoding the bytes have, which is 
why it has a bug.

Cheers,
Steve