[Python-ideas] Fix default encodings on Windows

Wed Aug 10 19:04:00 EDT 2016

On Wed, Aug 10, 2016 at 8:09 PM, Random832 <random832 at fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>>
>> Allowing library developers who support POSIX and Windows to just use
>> bytes everywhere to represent paths.
>
> Okay, how is that use case impacted by it being mbcs instead of utf-8?

Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
locales that use a DBCS codepage such as 932. If a sequence is
invalid, it gets passed to the filesystem as the default Unicode
character, so it won't successfully roundtrip. In the following
example b"\x81\xad", which isn't defined in CP932, gets mapped to the
codepage's default Unicode character, Katakana middle dot, which
encodes back as b"\x81E":

    >>> locale.getpreferredencoding()
    'cp932'
    >>> open(b'\x81\xad', 'w').close()
    >>> os.listdir('.')
    ['・']
    >>> unicodedata.name(os.listdir('.')[0])
    'KATAKANA MIDDLE DOT'
    >>> '・'.encode('932')
    b'\x81E'

This isn't a problem for single-byte codepages, since every byte value
uniquely maps to a Unicode code point, even if it's simply b'\x81' =>
u"\x81". Obviously there's still the general problem of dealing with
arbitrary Unicode filenames created by other programs, since the ANSI
API can only return a best-fit encoding of the filename, which is
useless for actually accessing the file.

>> It probably also entails opening the file descriptor in bytes mode,
>> which might break programs that pass the fd directly to CRT functions.
>> Personally I wish they wouldn't, but it's too late to stop them now.
>
> The only thing O_TEXT does rather than O_BINARY is convert CRLF line
> endings (and maybe end on ^Z), and I don't think we even expose the
> constants for the CRT's unicode modes.

Python 3 uses O_BINARY when opening files, unless you explicitly call
os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags
if the platform defines it.

The Windows CRT reads the BOM for the Unicode modes O_WTEXT,
O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires
opening the file twice, the first time with read access. See
configure_text_mode() in "Windows
Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp".

Python doesn't expose or use these Unicode text-mode constants. That's
for the best because in Unicode mode the CRT invokes the invalid
parameter handler when a buffer doesn't have an even number of bytes,
i.e. a multiple of sizeof(wchar_t). Python could copy how
configure_text_mode() handles the BOM, except it shouldn't write a BOM
for new UTF-8 files.