On Sat, Jan 23, 2021 at 9:22 PM Inada Naoki <songofacandy@gmail.com> wrote:

On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum <guido@python.org> wrote:
>
> I have definitely seen BOMs written by Notepad on Windows 10.
>
> Why can’t the future be that open() in text mode guesses the encoding?

I don't like guessing. As a Japanese, I have seen many mojibake caused
by the wrong guess.
I don't think guessing encoding is not a good part of reliable software.

I agree that guessing encodings in general is a bad idea and is an avenue for subtle localization issues - bad things will happen when it guesses wrong, and it will lead to code that works properly on the developer's machine and fails for end users. It makes sense for a text editor to try to guess, because showing the user something is better than nothing (and if it guesses wrong the user can easily see that, and perhaps take some manual action to correct it). It does not make sense for a programming language to guess, because the user cannot easily detect or correct an incorrect guess, and mistakes will tend to be propagated rather than caught.

On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:

Rather than introducing a new `open_utf8` function, I'd suggest the following:

1. Deprecate calling `open` for text mode (the default) unless an `encoding=` is specified, and 3 years after deprecation change the default encoding for `open` to "utf-8-sig" for reading and "utf-8" for writing (to ignore a BOM if one exists when reading, but to not create a BOM when writing).

2. At the same time as the deprecation is announced, introduce a new __future__ import named "utf8_open" or something like that, to opt into the future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a file in text mode and no explicit encoding is specified.

I think a __future__ import solves the problem better than introducing a new function would. Users who already have a UTF-8 locale (the majority of users on the majority of platforms) could simply turn on the new __future__ import in any files where they're calling open() with no change in behavior, suppressing the deprecation warning. Users who have a non-UTF-8 locale and want to keep opening text files in that non-UTF-8 locale by default can add encoding=locale.getpreferredencoding(False) to retain the old behavior, suppressing the deprecation warning. And perhaps we could make a shortcut for that, like encoding="locale".

~Matt