[Python-ideas] Fix default encodings on Windows

Wed Aug 10 14:46:25 EDT 2016

On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
> To summarise the proposals (remembering that these would only affect 
> Python 3.6 on Windows):
> 
> * change sys.getfilesystemencoding() to return 'utf-8'
> * automatically decode byte paths assuming they are utf-8
> * remove the deprecation warning on byte paths

Why? What's the use case?

> * make the default open() encoding check for a BOM or else use utf-8
> * [ALTERNATIVE] make the default open() encoding check for a BOM or else 
> use sys.getpreferredencoding()

For reading, I assume. When opened for writing, it should probably be
utf-8-sig [if it's not mbcs] to match what Notepad does. What about
files opened for appending or updating? In theory it could ingest the
whole file to see if it's valid UTF-8, but that has a time cost. 

Notepad, if there's no BOM, checks the first 256 bytes of the file for
whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
and can get it wrong for certain very short files [i.e. the infamous
"this app can break"]

What to do on opening a pipe or device? [Is os.fstat able to detect
these cases?]

Maybe the BOM detection phase should be deferred until the first read.
What should encoding be at that point if this is done? Is there a
"utf-any" encoding that can handle all five BOMs? If not, should there
be? how are "utf-16" and "utf-32" files opened for appending or updating
handled today?

> * force the console encoding to UTF-8 on initialize and revert on
> finalize

Why not implement a true unicode console? What if sys.stdin/stdout are
pipes (or non-console devices such as a serial port)?