[Python-Dev] File system path encoding on Windows

Steve Dower steve.dower at python.org
Fri Aug 19 15:33:58 EDT 2016


On 19Aug2016 1225, Daniel Holth wrote:
> #1 sounds like a great idea. I suppose surrogatepass solves
> approximately the same problem of Rust's WTF-8, which is a way to
> round-trip bad UCS-2? https://simonsapin.github.io/wtf-8/

Yep.

> #2 sounds like it would leave several problems, since mbcs is not the
> same as a normal text encoding, IIUC it depends on the active code page.
> So if your active code page is Russian you might not be able to encode
> Japanese characters into MBCS.

That's correct. In 99% (or more) of cases, mbcs is going to be the same 
as what we currently have. The difference is that when we encode/decode 
in CPython we can use a different handler than 'replace' and at least 
prevent the _silent_ data loss.

> Solution #2a Modify Windows so utf-8 is a valid value for the current
> MBCS code page.

Presumably a joke, but won't happen because too many applications assume 
that the active code page is one byte per character, which it isn't, but 
it's close enough that most of the time you never notice. (Incidentally, 
this is also the problem with utf-16, since many applications also 
assume that it's always one wchar_t per character and get away with it. 
At least with utf-8 you encounter multi-byte sequences often enough that 
you basically are forced to deal with them.)

Cheers,
Steve


More information about the Python-Dev mailing list