[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Thu Aug 18 13:31:43 EDT 2016
On 18Aug2016 1018, MRAB wrote:
> Could we use still call it 'mbcs', but use 'surrogateescape'?
surrogateescape is used for escaping undecodable values when you want to
represent arbitrary bytes in Unicode.
It's the wrong direction for this situation - we are starting with valid
Unicode and encoding to bytes (for the convenience of the Python
developer who wants to use bytes everywhere). Bytes correctly encoded
under mbcs can always be correctly decoded to Unicode ('correctly'
implies that they were encoded with the same configuration as the
machine doing the decoding - mbcs changes from machine to machine).
So there's nothing to escape from mbcs->Unicode, and we don't control
the definition of Unicode->mbcs well enough to be able to invent an
escaping scheme while remaining compatible with the operating system's
interpretation of mbcs (CP_ACP).
(One way to look at the utf-8 proposal is saying "we will escape
arbitrary Unicode characters within Python bytes strings and decode them
at the Python-OS boundary". The main concern about this is the backwards
compatibility issues around people taking arbitrarily encoded bytes and
sharing them without including the encoding. Previously that would work
on a subset of machines without Unicode support, but this change would
only make it work within Python 3.6 and later. Hence the discussion
about whether this whole thing was deprecated already or not.)
Cheers,
Steve
More information about the Python-ideas
mailing list