2016-08-16 17:56 GMT+02:00 Steve Dower <steve.dower@python.org>:
2. Windows file system encoding is *always* UTF-16. There's no "assuming mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding it is". We know exactly what the encoding is on every supported version of Windows. UTF-16.
I think that you missed a important issue (or "use case") which is called the "Makefile problem" by Mercurial developers: https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.... I already explained it before, but maybe you misunderstood or just missed it, so here is a more concrete example. A runner.py script produces a bytes filename and sends it to a second read_file.py script through stdin/stdout. The read_file.py script opens the file using open(filename). The read_file.py script is run by Python 2 which works naturally on bytes. The question is how the runner.py produces (encodes) the filename. runner.py (script run by Python 3.7): --- import os, sys, subprocess, tempfile filename = 'h\xe9.txt' content = b'foo bar' print("filename unicode: %a" % filename) root = os.path.realpath(os.path.dirname(__file__)) script = os.path.join(root, 'read_file.py') old_cwd = os.getcwd() with tempfile.TemporaryDirectory() as tmpdir: os.chdir(tmpdir) with open(filename, 'wb') as fp: fp.write(content) filenameb = os.listdir(b'.')[0] # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8? print("filename bytes: %a" % filenameb) proc = subprocess.Popen(['py', '-2', script], stdin=subprocess.PIPE, stdout=subprocess.PIPE) stdout = proc.communicate(filenameb)[0] print("File content: %a" % stdout) os.chdir(old_cwd) --- read_file.py (run by Python 2): --- import sys filename = sys.stdin.read() # Python 2 calls the Windows C open() function # which expects a filename encoded to the ANSI code page with open(filename) as fp: content = fp.read() sys.stdout.write(content) sys.stdout.flush() --- read_file.py only works if the non-ASCII filename is encoded to the ANSI code page. The question is how you expect developers should handle such use case. For example, are developers responsible to transcode communicate() data (input and outputs) manually? That's why I keep repeating that ANSI code page is the best *default* encoding because it is the encoded expected by other applications. I know that the ANSI code page is usually limited and caused various painful issues when handling non-ASCII data, but it's the status quo if you really want to handle data as bytes... Sorry, I didn't read all emails of this long thread, so maybe I missed your answer to this issue. Victor