On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower@python.org> wrote:
(Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.)
The CRT manually decodes and encodes using the private functions __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use either the ANSI or OEM codepage, depending on the value returned by WinAPI AreFileApisANSI. CPython could follow suit. Doing its own encoding and decoding would enable using filesystem functions that will never get an [A]NSI version (e.g. GetFileInformationByHandleEx), while still retaining backward compatibility. Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning when lpUsedDefaultChar is true. Filesystem decoding could use MB_ERR_INVALID_CHARS and raise a warning and retry without this flag for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This could be implemented with a new "warning" handler for PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose betwen CP_ACP and CP_OEMCP.