On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower <steve.dower@python.org> wrote:
On 10Aug2016 1431, Chris Angelico wrote:
I'd rather a single consistent default encoding.
I'm proposing to make that single consistent default encoding utf-8. It sounds like we're in agreement?
Yes, we are. I was disagreeing with Random's suggestion that mbcs would also serve. Defaulting to UTF-8 everywhere is (a) consistent on all systems, regardless of settings; and (b) consistent with bytes.decode() and str.encode(), both of which default to UTF-8.
-0.5. Is there any precedent for this kind of data-based detection being the default? An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The BOM exists solely for data-based detection, and the UTF-8 BOM is different from the UTF-16 and UTF-32 ones. So we either find an exact BOM (which IIRC decodes as a no-op spacing character, though I have a feeling some version of Unicode redefined it exclusively for being the marker) or we use utf-8.
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists) * utf-8 when writing (so the BOM is *not* written)
This provides the best compatibility when reading/writing files without making any guesses. We could reasonably extend this to read utf-16 and utf-32 if they have a BOM, but that's an extension and not necessary for the main change.
AIUI the utf-8-sig encoding is happy to decode something that doesn't have a signature, right? If so, then yes, I would definitely support that mild mismatch in defaults. Chew up that UTF-8 aBOMination and just use UTF-8 as is. I've almost never seen files stored in UTF-32 (even UTF-16 isn't all that common compared to UTF-8), so I wouldn't stress too much about that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth doing, but it could easily be retrofitted (that byte sequence won't decode as UTF-8).
* force the console encoding to UTF-8 on initialize and revert on finalize
-0 for Python itself; +1 for Python's interactive interpreter. Programs that mess with console settings get annoying when they crash out and don't revert properly. Unless there is *no way* that you could externally kill the process without also bringing the terminal down, there's the distinct possibility of messing everything up.
The main problem here is that if the console is not forced to UTF-8 then it won't render any of the characters correctly.
Ehh, that's annoying. Is there a way to guarantee, at the process level, that the console will be returned to "normal state" when Python exits? If not, there's the risk that people run a Python program and then the *next* program gets into trouble. But if that happens only on abnormal termination ("I killed Python from Task Manager, and it left stuff messed up so I had to close the console"), it's probably an acceptable risk. And the benefit sounds well worthwhile. Revising my recommendation to +0.9. ChrisA