
On Wed, Aug 10, 2016 at 8:09 PM, Random832 random832@fastmail.com wrote:
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
Allowing library developers who support POSIX and Windows to just use bytes everywhere to represent paths.
Okay, how is that use case impacted by it being mbcs instead of utf-8?
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932. If a sequence is invalid, it gets passed to the filesystem as the default Unicode character, so it won't successfully roundtrip. In the following example b"\x81\xad", which isn't defined in CP932, gets mapped to the codepage's default Unicode character, Katakana middle dot, which encodes back as b"\x81E":
>>> locale.getpreferredencoding() 'cp932' >>> open(b'\x81\xad', 'w').close() >>> os.listdir('.') ['・'] >>> unicodedata.name(os.listdir('.')[0]) 'KATAKANA MIDDLE DOT' >>> '・'.encode('932') b'\x81E'
This isn't a problem for single-byte codepages, since every byte value uniquely maps to a Unicode code point, even if it's simply b'\x81' => u"\x81". Obviously there's still the general problem of dealing with arbitrary Unicode filenames created by other programs, since the ANSI API can only return a best-fit encoding of the filename, which is useless for actually accessing the file.
It probably also entails opening the file descriptor in bytes mode, which might break programs that pass the fd directly to CRT functions. Personally I wish they wouldn't, but it's too late to stop them now.
The only thing O_TEXT does rather than O_BINARY is convert CRLF line endings (and maybe end on ^Z), and I don't think we even expose the constants for the CRT's unicode modes.
Python 3 uses O_BINARY when opening files, unless you explicitly call os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags if the platform defines it.
The Windows CRT reads the BOM for the Unicode modes O_WTEXT, O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires opening the file twice, the first time with read access. See configure_text_mode() in "Windows Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp".
Python doesn't expose or use these Unicode text-mode constants. That's for the best because in Unicode mode the CRT invokes the invalid parameter handler when a buffer doesn't have an even number of bytes, i.e. a multiple of sizeof(wchar_t). Python could copy how configure_text_mode() handles the BOM, except it shouldn't write a BOM for new UTF-8 files.