[Python-ideas] Fix default encodings on Windows

Wed Aug 10 17:31:17 EDT 2016

On Thu, Aug 11, 2016 at 6:09 AM, Random832 <random832 at fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>> > Why? What's the use case? [byte paths]
>>
>> Allowing library developers who support POSIX and Windows to just use
>> bytes everywhere to represent paths.
>
> Okay, how is that use case impacted by it being mbcs instead of utf-8?

AIUI, the data flow would be: Python bytes object -> decode to Unicode
text -> encode to UTF-16 -> Windows API.  If you do the first
transformation using mbcs, you're guaranteed *some* result (all
Windows codepages have definitions for all byte values, if I'm not
mistaken), but a hard-to-predict one - and worse, one that can change
based on system settings. Also, if someone naively types
"bytepath.decode()", Python will default to UTF-8, *not* to the system
codepage.

I'd rather a single consistent default encoding.

> What about only doing the deprecation warning if non-ascii bytes are
> present in the value?

-1. Data-dependent warnings just serve to strengthen the feeling that
"weird characters" keep breaking your programs, instead of writing
your program to cope with all characters equally. It's like being
racist against non-ASCII characters :)

On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve.dower at python.org> wrote:
> To summarise the proposals (remembering that these would only affect Python
> 3.6 on Windows):
>
> * change sys.getfilesystemencoding() to return 'utf-8'
> * automatically decode byte paths assuming they are utf-8
> * remove the deprecation warning on byte paths

+1 on these.

> * make the default open() encoding check for a BOM or else use utf-8

-0.5. Is there any precedent for this kind of data-based detection
being the default? An explicit "utf-sig" could do a full detection,
but even then it's not perfect - how do you distinguish UTF-32LE from
UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
assume UTF-16", or do you say "files starting U+0000 are rare, so
we'll assume UTF-32"?

> * [ALTERNATIVE] make the default open() encoding check for a BOM or else use
> sys.getpreferredencoding()

-1. Same concerns as the above, plus I'd rather use the saner default.

> * force the console encoding to UTF-8 on initialize and revert on finalize

-0 for Python itself; +1 for Python's interactive interpreter.
Programs that mess with console settings get annoying when they crash
out and don't revert properly. Unless there is *no way* that you could
externally kill the process without also bringing the terminal down,
there's the distinct possibility of messing everything up.

Would it be possible to have a "sys.setconsoleutf8()" that changes the
console encoding and slaps in an atexit() to revert? That would at
least leave it in the hands of the app.

Overall I'm +1 on shifting from eight-bit encodings to UTF-8. Don't be
held back by what Notepad does.

ChrisA