On 15Aug2016 1819, eryk sun wrote:
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower
(Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate pairs. But that would prevent basic manipulation which seems to be a higher priority.)
The CRT manually decodes and encodes using the private functions __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use either the ANSI or OEM codepage, depending on the value returned by WinAPI AreFileApisANSI. CPython could follow suit. Doing its own encoding and decoding would enable using filesystem functions that will never get an [A]NSI version (e.g. GetFileInformationByHandleEx), while still retaining backward compatibility.
Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning when lpUsedDefaultChar is true. Filesystem decoding could use MB_ERR_INVALID_CHARS and raise a warning and retry without this flag for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This could be implemented with a new "warning" handler for PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose betwen CP_ACP and CP_OEMCP.
None of that makes it less complicated or more reliable. Warnings based on values are bad (they should be based on types) and using the *W APIs exclusively is the right way to go. The question then is whether we allow file system functions to return bytes, and if so, which encoding to use. This then directly informs what the functions accept, for the purposes of round-tripping. *Any* encoding that may silently lose data is a problem, which basically leaves utf-16 as the only option. However, as that causes other problems, maybe we can accept the tradeoff of returning utf-8 and failing when a path contains invalid surrogate pairs (which is extremely rare by comparison to characters outside of CP_ACP)? If utf-8 is unacceptable, we're back to the current situation and should be removing the support for bytes that was deprecated three versions ago. Cheers, Steve