On Sun, Jan 24, 2021 at 6:33 PM Inada Naoki <songofacandy@gmail.com> wrote:
My previous thread is hijacked about "auto guessing" idea,

yes -- I'm a bit confused by that -- are folks advocating for making some sort of encoding detection the default? or available as an option in the stdlib? -- in any case, Ithink that could be an independent proposal.

First: I really want to see this get pushed forward and get done, one way or another -- using a system setting as a default is a really bad idea in this day of interconnected computers.

But back to PEP 597, and how to get there:

1) We need to start with a consensus about where we want Python to be in N versions. That is not specifically laid out in the PEP but it does imply that in the sometime-long-in-the-future:

- TextIOWrapper will have utf-8 as the default, rather than `locale.getpreferredencoding(False)`
this behaviour will then be inherited by:
- `open()` without a binary flag in the mode

- `Path.read_text`
- there will be a string that can be passed to encoding that will indicate that the system default should be used.

(and any other utility functions that use TextIOWrapper)

Forgive me if there is already a consensus on this -- but this discussion has brought up some thoughts.

1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices.

2) Inada proposed an open_text() function be introduced as a stepping stone, with the new behaviour. This led to one person asking if that would imply a open_binary() function as well. An answer to that was no -- as no one is suggesting any changes to open()'s behavior for binary files.
However, I kind of like the idea. We now have two (at least) different file objects potentially returned by open(): TextIOWrapper, and BufferedReader/Writer. And the TextIOWrapper has some pretty different behavior. I *think* that in virtually all cases, when the code is written, the author knows whether they want a binary or text file, so it may make sense to have two different open() functions, rather than having the Type returned be a function of what mode flags are passed.

This would make it easier for people (and tools) to reason about the code with static analysis:


open_text().read() would return a string
open_binary().read() would return bytes

This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again.

Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky.

On to:

> Path.open() was added in Python 3.4. Path.read_text() and
Path.write_text() was added in Python 3.5.
Their history is shorter than built-in open(). Changing its default
encoding should be easier than built-in open and TextIOWrapper.
New default encodings are:

* read_text() default encoding is "utf-8-sig"
* write_text() default encoding is "utf-8"
* open() default encoding is "utf-8-sig" when mode is "r" or None,
"utf-8" otherwise.

How do you think this idea?

+1 there is a lot less legacy with Path -- we can move faster. And I honestly still wonder if making utf-8 the default with cause or fix more bugs :-)

A thought on that -- there is currently both kinds of code "in the wild":
 (A) code that uses the default, when they really want utf-8 -- currently a bug, won't be a bug in the future.
 (B) code that uses the default when it really does want the system encoding. -- currently correct, will become a bug in the future

It's anyone's guess which of these is more common, but one thing to consider is that (A) is a hidden bug that might reveal itself in the hands of end users who knows when in the future. Whereas (B) will be a bug that is likely to reveal itself fairly quickly (though perhaps also in the (confused) hands of end users as well)

-Chris B

Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython