
2016-08-17 1:27 GMT+02:00 Steve Dower <steve.dower@python.org>:
filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb)
proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout)
If you are defining the encoding as 'mbcs', then you need to check that sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
Sorry, I don't understand. What do you mean by "defining an encoding"? It's not possible to modify sys.getfilesystemencoding() in Python. What does "reencode"? I'm lost.
Alternatively, since this script is the "new" code, you would use `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly determined that mbcs is the encoding for the later transfer.
My example is not new code. It is a very simplified script to explain the issue that can occur in a large code base which *currently* works well on Python 2 and Pyhon 3 in the common case (only handle data encodable to the ANSI code page).
Essentially, the problem is that this code is relying on a certain non-guaranteed behaviour of a deprecated API, where using sys.getfilesystemencoding() as documented would have prevented any issue (see https://docs.python.org/3/library/os.html#file-names-command-line-arguments-...).
sys.getfilesystemencoding() is used in applications which store data as Unicode, but we are talking about applications storing data as bytes, no?
So yes, breaking existing code is something I would never do lightly. However, I'm very much of the opinion that the only code that will break is code that is already broken (or at least fragile) and that nobody is forced to take a major upgrade to Python or should necessarily expect 100% compatibility between major versions.
Well, it's somehow the same issue that we had in Python 2: applications work in most cases, but start to fail with non-ASCII characters, or maybe only in some cases. In this case, the ANSI code page is fine if all data can be encoded to the ANSI code page. You start to get troubles when you start to use characters not encodable to your ANSI code page. Last time I checked, Microsoft Visual Studio behaved badly (has bugs) with such filenames. It's the same for many applications. So it's not like Windows applications already handle this case very well. So let me call it a corner case. I'm not sure that it's worth it to explicitly break the Python backward compatibility on Windows for such corner case, especially because it's already possible to fix applications by starting to use Unicode everywhere (which would likely fix more issues than expected as a side effect). It's still unclear to me if it's simpler to modify an application using bytes to start using Unicode (for filenames), or if your proposition requires less changes. My main concern is the "makefile issue" which requires more complex code to transcode data between UTF-8 and ANSI code page. To me, it's like we are going back to Python 2 where no data had known encoding and mojibake was the default. If you manipulate strings in two encodings, it's likely to make mistakes and concatenate two strings encoded to two different encodings (=> mojibake). Victor