On 10Aug2016 1630, Random832 wrote:
On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in locales that use a DBCS codepage such as 932.
Er... utf-8 doesn't work reliably with arbitrary bytes paths either, unless you intend to use surrogateescape (which you could also do with mbcs).
Is there any particular reason to expect all bytes paths in this scenario to be valid UTF-8?
On Windows, all paths are effectively UCS-2 (they are defined as UTF-16, but surrogate pairs don't seem to be validated, which IIUC means it's really UCS-2), so while the majority can be encoded as valid UTF-8, there are some paths which cannot. (These paths are going to break many other tools though, such as PowerShell, so we won't be in bad company if we can't handle them properly in edge cases).
surrogateescape is irrelevant because it's only for decoding from bytes. An alternative approach would be to replace mbcs with a ucs-2 encoding that is basically just a blob of the path that was returned from Windows (using the Unicode APIs). None of the manipulation functions would work on this though, since nearly every second character would be \x00, but it's the only way (besides using str) to maintain full fidelity for every possible path name.
Compromising on UTF-8 is going to increase consistency across platforms and across different Windows installations without increasing the rate of errors above what we currently see (given that invalid characters are currently replaced with '?'). It's not a 100% solution, but it's a 99% solution where the 1% is not handled well by anyone.