[Python-ideas] Fix default encodings on Windows

Steve Dower steve.dower at python.org
Wed Aug 10 19:40:31 EDT 2016


On 10Aug2016 1431, Chris Angelico wrote:
> On Thu, Aug 11, 2016 at 6:09 AM, Random832 <random832 at fastmail.com> wrote:
>> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>>>> Why? What's the use case? [byte paths]
>>>
>>> Allowing library developers who support POSIX and Windows to just use
>>> bytes everywhere to represent paths.
>>
>> Okay, how is that use case impacted by it being mbcs instead of utf-8?
>
> AIUI, the data flow would be: Python bytes object -> decode to Unicode
> text -> encode to UTF-16 -> Windows API.  If you do the first
> transformation using mbcs, you're guaranteed *some* result (all
> Windows codepages have definitions for all byte values, if I'm not
> mistaken), but a hard-to-predict one - and worse, one that can change
> based on system settings. Also, if someone naively types
> "bytepath.decode()", Python will default to UTF-8, *not* to the system
> codepage.
>
> I'd rather a single consistent default encoding.

I'm proposing to make that single consistent default encoding utf-8. It 
sounds like we're in agreement?

>> What about only doing the deprecation warning if non-ascii bytes are
>> present in the value?
>
> -1. Data-dependent warnings just serve to strengthen the feeling that
> "weird characters" keep breaking your programs, instead of writing
> your program to cope with all characters equally. It's like being
> racist against non-ASCII characters :)

Agreed. This won't happen.

> On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve.dower at python.org> wrote:
>> To summarise the proposals (remembering that these would only affect Python
>> 3.6 on Windows):
>>
>> * change sys.getfilesystemencoding() to return 'utf-8'
>> * automatically decode byte paths assuming they are utf-8
>> * remove the deprecation warning on byte paths
>
> +1 on these.
>
>> * make the default open() encoding check for a BOM or else use utf-8
>
> -0.5. Is there any precedent for this kind of data-based detection
> being the default? An explicit "utf-sig" could do a full detection,
> but even then it's not perfect - how do you distinguish UTF-32LE from
> UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
> assume UTF-16", or do you say "files starting U+0000 are rare, so
> we'll assume UTF-32"?

The BOM exists solely for data-based detection, and the UTF-8 BOM is 
different from the UTF-16 and UTF-32 ones. So we either find an exact 
BOM (which IIRC decodes as a no-op spacing character, though I have a 
feeling some version of Unicode redefined it exclusively for being the 
marker) or we use utf-8.

But the main reason for detecting the BOM is that currently opening 
files with 'utf-8' does not skip the BOM if it exists. I'd be quite 
happy with changing the default encoding to:

* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
* utf-8 when writing (so the BOM is *not* written)

This provides the best compatibility when reading/writing files without 
making any guesses. We could reasonably extend this to read utf-16 and 
utf-32 if they have a BOM, but that's an extension and not necessary for 
the main change.

>> * force the console encoding to UTF-8 on initialize and revert on finalize
>
> -0 for Python itself; +1 for Python's interactive interpreter.
> Programs that mess with console settings get annoying when they crash
> out and don't revert properly. Unless there is *no way* that you could
> externally kill the process without also bringing the terminal down,
> there's the distinct possibility of messing everything up.

The main problem here is that if the console is not forced to UTF-8 then 
it won't render any of the characters correctly.

> Would it be possible to have a "sys.setconsoleutf8()" that changes the
> console encoding and slaps in an atexit() to revert? That would at
> least leave it in the hands of the app.

Yes, but if the app is going to opt-in then I'd suggest the 
win_unicode_console package, which won't require any particular changes.

It sounds like we'll have to look into effectively merging that package 
into the core. I'm afraid that'll come with a much longer tail of bugs 
(and will quite likely break code that expects to use file descriptors 
to access stdin/out), but it's the least impactful way to do it.

Cheers,
Steve



More information about the Python-ideas mailing list