[Python-ideas] Fix default encodings on Windows

Steve Dower steve.dower at python.org
Thu Aug 18 13:31:43 EDT 2016


On 18Aug2016 1018, MRAB wrote:
> Could we use still call it 'mbcs', but use 'surrogateescape'?

surrogateescape is used for escaping undecodable values when you want to 
represent arbitrary bytes in Unicode.

It's the wrong direction for this situation - we are starting with valid 
Unicode and encoding to bytes (for the convenience of the Python 
developer who wants to use bytes everywhere). Bytes correctly encoded 
under mbcs can always be correctly decoded to Unicode ('correctly' 
implies that they were encoded with the same configuration as the 
machine doing the decoding - mbcs changes from machine to machine).

So there's nothing to escape from mbcs->Unicode, and we don't control 
the definition of Unicode->mbcs well enough to be able to invent an 
escaping scheme while remaining compatible with the operating system's 
interpretation of mbcs (CP_ACP).

(One way to look at the utf-8 proposal is saying "we will escape 
arbitrary Unicode characters within Python bytes strings and decode them 
at the Python-OS boundary". The main concern about this is the backwards 
compatibility issues around people taking arbitrarily encoded bytes and 
sharing them without including the encoding. Previously that would work 
on a subset of machines without Unicode support, but this change would 
only make it work within Python 3.6 and later. Hence the discussion 
about whether this whole thing was deprecated already or not.)

Cheers,
Steve


More information about the Python-ideas mailing list