[Python-ideas] Fix default encodings on Windows

eryk sun eryksun at gmail.com
Wed Aug 17 01:49:50 EDT 2016


On Tue, Aug 16, 2016 at 3:56 PM, Steve Dower <steve.dower at python.org> wrote:
>
> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
> it is". We know exactly what the encoding is on every supported version of
> Windows. UTF-16.

Internal filesystem details don't directly affect this issue, except
for how each filesystem handles invalid surrogates in names passed to
functions in the wide-character API. Some filesystems that are
available on Windows do reject a filename that has an invalid
surrogate, so I think any program that attempts to create such
malformed names is already broken.

For example, with NTFS I can create a file named
"\ud800b\ud800a\ud800d", but trying this in a VirtualBox shared folder
fails because the VBoxSF filesystem can't transcode the name to its
internal UTF-8 encoding. Thus I don't think supporting invalid
surrogates should be a deciding factor in favor of UTF-16, which I
think is an unpractical choice. Bytes coming from files, databases,
and the network are likely to be either UTF-8 or some legacy encoding,
so the practical choice is between ANSI/OEM and UTF-8. The reliable
choice is UTF-8.

Using UTF-8 for bytes paths can be adopted at first in 3.6 as an
option that gets enabled via an environment variable. If it's not
enabled or explicitly disabled, show a visible warning (i.e. not
requiring -Wall) that legacy bytes paths are deprecated. In 3.7 UTF-8
can become the default, but the same environment variable should allow
opting out to use the legacy encoding. The infrastructure put in place
to support this should be able to work either way.

Victor, I haven't checked Steve's patch yet in issue 27781, but making
this change should largely simplify the Windows support code in many
cases, as the bytes path conversion can be centralized, and relatively
few functions return strings that need to be encoded back as bytes.
posixmodule.c will no longer need separate code paths that call *A
functions, e.g.:

    CreateFileA, CreateDirectoryA, CreateHardLinkA, CreateSymbolicLinkA,
    DeleteFileA, RemoveDirectoryA, FindFirstFileA, MoveFileExA,
    GetFileAttributesA, GetFileAttributesExA, SetFileAttributesA,
    GetCurrentDirectoryA, SetCurrentDirectoryA, SetEnvironmentVariableA,
    ShellExecuteA


More information about the Python-ideas mailing list