On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve.dower@python.org> wrote:
and using the *W APIs exclusively is the right way to go.
My proposal was to use the wide-character APIs, but transcoding CP_ACP without best-fit characters and raising a warning whenever the default character is used (e.g. substituting Katakana middle dot when creating a file using a bytes path that has an invalid sequence in CP932). This proposal was in response to the case made by Stephen Turnbull. If using UTF-8 is getting such heavy pushback, I thought half a solution was better than nothing, and it also sets up the infrastructure to easily switch to UTF-8 if that idea eventually gains acceptance. It could raise exceptions instead of warnings if that's preferred, since bytes paths on Windows are already deprecated.
*Any* encoding that may silently lose data is a problem, which basically leaves utf-16 as the only option. However, as that causes other problems, maybe we can accept the tradeoff of returning utf-8 and failing when a path contains invalid surrogate pairs
Are there any common sources of illegal UTF-16 surrogates in Windows filenames? I see that WTF-8 (Wobbly) was developed to handle this problem. A WTF-8 path would roundtrip back to the filesystem, but it should only be used internally in a program.