[Python-Dev] Re: PEP 597: Add optional EncodingWarning

Feb. 9, 2021

      On Tue, 9 Feb 2021 at 16:54, Inada Naoki <songofacandy@gmail.com> wrote:
...
On Tue, Feb 9, 2021 at 9:31 PM Paul Moore <p.f.moore@gmail.com> wrote:
...
Personally, I'm not at all keen on the idea of making users always
specify encoding in the first place, even if it's "just for the
transition".
I agree with you. But as I wrote in the PEP, omitted encoding caused
much troubles already.
Windows users can not just `pip install somepkg` because some library
authors write `long_description=open("README.md").read()` in setup.py.
I am trying to fix this situation by two parallel approaches:
* (This PEP) Provide a tool for finding this type of bugs, and
recommend `encoding="utf-8"` for cross-platform library authors.
* (Author thread) Make UTF-8 mode more usable for Windows users,
especially students.
Thanks for explaining (again). There's so much debate, across multiple
proposals, that I can barely follow it. I'm impressed that you're
managing to keep things straight at all :-)

I guess my views on this PEP come down to

* I see no harm in having a tool that helps developers spot
platform-specific assumptions about encoding.
* Realistically, I'd be surprised if developers actually use such a
tool. If they were likely to do so, they could probably just as easily
locate all the uses of open() in their code, and check that way. So
I'm not sure this proposal is actually worth it, even if the end
result would be very beneficial.
* In the setup.py case, why don't those same Windows users complain
that the library fails to install? A quick bug report, followed by a
simple fix, seems more likely to happen than the developer suddenly
deciding to scan their code for encoding issues.

Regarding the wider question of UTF8 as default, my views can probably
be summarised as follows:

* If you want to write correct code to deal with encodings, there is
no substitute for carefully considering every bytes/string conversion,
deciding how you are going to identify the encoding to use, and then
specifying that encoding explicitly. Default values for encodings have
no place in such code.
* In reality, though, that's far too much work for many situations.
Default encodings are a necessary convenience, particularly for simple
scripts, or for people who can't, or don't want to, do the analysis
that the "correct" approach implies.
* Picking the right default is *hard*. Changing the default is even
harder, unfortunately.
* I feel that we already have a number of mechanisms (PEPs 538 and
540) trying to tackle this issue. Adding yet more suggests to me that
we'd be better off pausing and working out why we still have an issue.
We should be moving towards *fewer* mechanisms, not more.
* We have UTF-8 mode, and users can set it per-process (via flag or
environment variable) per-user or per-site (by environment variable).
I don't honestly believe that a user (whatever OS they work on) who is
capable of writing Python code, can't be shown how to set an
environment variable. I see no reason to suggest we need yet another
way to set UTF-8 mode, or that a per-interpreter or per-virtualenv
setting is particularly crucial (suggestions that have been made in
the Python-Ideas threads).
* UTF-8 is likely to be the most appropriate default encoding for
Python in the longer term, and I agree that Windows is fast
approaching the point where a UTF-8 encoding is more appropriate than
the ANSI codepage for "new stuff". But there's a lot of legacy files
and applications around, and I suspect that a UTF-8 default will
inconvenience a lot of people working with such data. But equally,
such people may not be in a huge rush to switch to the latest Python
version. Whichever way we go, though, some people will be
inconvenienced.

I'm also somewhat bemused by the rather negative view of "Windows
beginners" that lies behind a lot of these discussions. People's
experiences may well differ, but the people I see using (and learning)
Python on Windows are often experienced computer users, maybe
developers with significant experience in Java or other "enterprise
languages", or data scientists who have a lot of knowledge of
computers, but are relatively new to programming. Or systems admins,
or database specialists, who want to use Python to write scripts on
Windows. None of those people fit the picture of people who wouldn't
know how to set an environment variable, or configure their
environment. On the other hand, (in my experience) they often don't
really have much knowledge of character encodings, and tend to just
use whatever default their PC uses, and expect it to work. They *can*,
however, understand when an encoding problem is explained to them, and
can set an explicit encoding once they know they need to.

Paul