On Mon, Feb 8, 2016 at 2:41 PM, Chris Barker
Just to clarify -- what does it currently do for bytes? IIUC, Windows uses UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming some Windows ANSI-compatible encoding? (and what does it return?)
UTF-16 is used in the [W]ide-character API. Bytes paths use the [A]NSI codepage. For a single-byte codepage, the ANSI API rountrips, i.e. a bytes path that's passed to CreateFileA matches the listing from FindFirstFileA. But for a DBCS codepage arbitrary bytes paths do not roundtrip. Invalid byte sequences map to the default character. Note that an ASCII question mark is not always the default character. It depends on the codepage. For example, in codepage 932 (Japanese), it's an error if a lead byte (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not uncommon). In this case the ANSI API substitutes the default character for Japanese, '・' (U+30FB, Katakana middle dot). >>> locale.getpreferredencoding() 'cp932' >>> open(b'\xe05', 'w').close() >>> os.listdir('.') ['・'] >>> os.listdir(b'.') [b'\x81E'] All invalid sequences get mapped to '・', which roundtrips as b'\x81\x45', so you can't reliably create and open files with arbitrary bytes paths in this locale.