Re: [Python-ideas] Fix default encodings on Windows

10 Aug 2016

      On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower  wrote:
...
On 10Aug2016 1431, Chris Angelico wrote:
...
I'd rather a single consistent default encoding.
I'm proposing to make that single consistent default encoding utf-8. It
sounds like we're in agreement?
Yes, we are. I was disagreeing with Random's suggestion that mbcs
would also serve. Defaulting to UTF-8 everywhere is (a) consistent on
all systems, regardless of settings; and (b) consistent with
bytes.decode() and str.encode(), both of which default to UTF-8.
...
...
-0.5. Is there any precedent for this kind of data-based detection
being the default? An explicit "utf-sig" could do a full detection,
but even then it's not perfect - how do you distinguish UTF-32LE from
UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
assume UTF-16", or do you say "files starting U+0000 are rare, so
we'll assume UTF-32"?
The BOM exists solely for data-based detection, and the UTF-8 BOM is
different from the UTF-16 and UTF-32 ones. So we either find an exact BOM
(which IIRC decodes as a no-op spacing character, though I have a feeling
some version of Unicode redefined it exclusively for being the marker) or we
use utf-8.
But the main reason for detecting the BOM is that currently opening files
with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with
changing the default encoding to:
* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
* utf-8 when writing (so the BOM is *not* written)
This provides the best compatibility when reading/writing files without
making any guesses. We could reasonably extend this to read utf-16 and
utf-32 if they have a BOM, but that's an extension and not necessary for the
main change.
AIUI the utf-8-sig encoding is happy to decode something that doesn't
have a signature, right? If so, then yes, I would definitely support
that mild mismatch in defaults. Chew up that UTF-8 aBOMination and
just use UTF-8 as is.

I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
that common compared to UTF-8), so I wouldn't stress too much about
that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
doing, but it could easily be retrofitted (that byte sequence won't
decode as UTF-8).
...
...
...
* force the console encoding to UTF-8 on initialize and revert on
finalize
-0 for Python itself; +1 for Python's interactive interpreter.
Programs that mess with console settings get annoying when they crash
out and don't revert properly. Unless there is *no way* that you could
externally kill the process without also bringing the terminal down,
there's the distinct possibility of messing everything up.
The main problem here is that if the console is not forced to UTF-8 then it
won't render any of the characters correctly.
Ehh, that's annoying. Is there a way to guarantee, at the process
level, that the console will be returned to "normal state" when Python
exits? If not, there's the risk that people run a Python program and
then the *next* program gets into trouble.

But if that happens only on abnormal termination ("I killed Python
from Task Manager, and it left stuff messed up so I had to close the
console"), it's probably an acceptable risk. And the benefit sounds
well worthwhile. Revising my recommendation to +0.9.

ChrisA