
On 18Aug2016 1018, MRAB wrote:
Could we use still call it 'mbcs', but use 'surrogateescape'?
surrogateescape is used for escaping undecodable values when you want to represent arbitrary bytes in Unicode. It's the wrong direction for this situation - we are starting with valid Unicode and encoding to bytes (for the convenience of the Python developer who wants to use bytes everywhere). Bytes correctly encoded under mbcs can always be correctly decoded to Unicode ('correctly' implies that they were encoded with the same configuration as the machine doing the decoding - mbcs changes from machine to machine). So there's nothing to escape from mbcs->Unicode, and we don't control the definition of Unicode->mbcs well enough to be able to invent an escaping scheme while remaining compatible with the operating system's interpretation of mbcs (CP_ACP). (One way to look at the utf-8 proposal is saying "we will escape arbitrary Unicode characters within Python bytes strings and decode them at the Python-OS boundary". The main concern about this is the backwards compatibility issues around people taking arbitrarily encoded bytes and sharing them without including the encoding. Previously that would work on a subset of machines without Unicode support, but this change would only make it work within Python 3.6 and later. Hence the discussion about whether this whole thing was deprecated already or not.) Cheers, Steve