Changing the default text encoding of pathlib

My previous thread is hijacked about "auto guessing" idea, so I split this thread for pathlib. Path.open() was added in Python 3.4. Path.read_text() and Path.write_text() was added in Python 3.5. Their history is shorter than built-in open(). Changing its default encoding should be easier than built-in open and TextIOWrapper. New default encodings are: * read_text() default encoding is "utf-8-sig" * write_text() default encoding is "utf-8" * open() default encoding is "utf-8-sig" when mode is "r" or None, "utf-8" otherwise. Of course, we need a regular deprecation period. When encoding is omitted, they emit DeprecationWarning (or EncodingWarning which is a subclass of DeprecationWarning) in three versions (Python 3.10~3.12). How do you think this idea? Should we "change all at once" rather than "step-by-step"? Regards, -- Inada Naoki <songofacandy@gmail.com>

On Sun, Jan 24, 2021 at 6:33 PM Inada Naoki <songofacandy@gmail.com> wrote:
My previous thread is hijacked about "auto guessing" idea,
yes -- I'm a bit confused by that -- are folks advocating for making some sort of encoding detection the default? or available as an option in the stdlib? -- in any case, Ithink that could be an independent proposal. First: I really want to see this get pushed forward and get done, one way or another -- using a system setting as a default is a really bad idea in this day of interconnected computers. But back to PEP 597, and how to get there: 1) We need to start with a consensus about where we want Python to be in N versions. That is not specifically laid out in the PEP but it does imply that in the sometime-long-in-the-future: - TextIOWrapper will have utf-8 as the default, rather than `locale.getpreferredencoding(False)` this behaviour will then be inherited by: - `open()` without a binary flag in the mode - `Path.read_text` - there will be a string that can be passed to encoding that will indicate that the system default should be used. (and any other utility functions that use TextIOWrapper) Forgive me if there is already a consensus on this -- but this discussion has brought up some thoughts. 1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices. 2) Inada proposed an open_text() function be introduced as a stepping stone, with the new behaviour. This led to one person asking if that would imply a open_binary() function as well. An answer to that was no -- as no one is suggesting any changes to open()'s behavior for binary files. However, I kind of like the idea. We now have two (at least) different file objects potentially returned by open(): TextIOWrapper, and BufferedReader/Writer. And the TextIOWrapper has some pretty different behavior. I *think* that in virtually all cases, when the code is written, the author knows whether they want a binary or text file, so it may make sense to have two different open() functions, rather than having the Type returned be a function of what mode flags are passed. This would make it easier for people (and tools) to reason about the code with static analysis: e.g.: open_text().read() would return a string open_binary().read() would return bytes This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again. Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky. On to:
Path.open() was added in Python 3.4. Path.read_text() and
How do you think this idea?
+1 there is a lot less legacy with Path -- we can move faster. And I honestly still wonder if making utf-8 the default with cause or fix more bugs :-) A thought on that -- there is currently both kinds of code "in the wild": (A) code that uses the default, when they really want utf-8 -- currently a bug, won't be a bug in the future. (B) code that uses the default when it really does want the system encoding. -- currently correct, will become a bug in the future It's anyone's guess which of these is more common, but one thing to consider is that (A) is a hidden bug that might reveal itself in the hands of end users who knows when in the future. Whereas (B) will be a bug that is likely to reveal itself fairly quickly (though perhaps also in the (confused) hands of end users as well) -Chris B -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, 25 Jan 2021 at 20:02, Christopher Barker <pythonchb@gmail.com> wrote:
using a system setting as a default is a really bad idea in this day of interconnected computers.
I'd mildly dispute this. There are (significant) downsides with the default behaviour being system-dependent, yes, but there are *also* disadvantages in having Python not behave consistently with other tools/programs on the same system. However, on POSIX, things are generally consistent, and *already* default to UTF-8. So the proposal is mostly going to affect Windows. And on Windows, there's not much consistency even on a single machine at the moment. Between OEM and ANSI codepages, and other tools that default to UTF-8 "because that's the future", there's not much platform consistency for Python to conform to anyway...
There's a fundamental assumption here that I think needs to be made explicit. Which is that we're assuming that whatever N happens to be, we anticipate that `locale.getpreferredencoding(False)` will still be something other than UTF-8. That's *already* false on most POSIX systems, and TBH I get the impression that Microsoft is pushing quite hard to move Windows 10 to a UTF-8 by default position (although "fast" in Microsoft terms may still be slow to the rest of us ;-)) So I think that the real question here is "do we want to move Python to "UTF8-by-default" faster than the OS vendors are going? And I think that the answer to that is much less obvious. It probably also depends heavily on your locale - I doubt it's an accident that Inada-san¹ is proposing this, and he's from Japan :-) Personally, as an English speaker based in the UK, I'll be happy when UTF-8 is the default everywhere, but I can live with the status quo until that happens. But I'm not the main target for this change.
1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices.
*shrug*. That sounds plausible, but it's a backward compatibility break that doesn't offer any significant benefits, so I suspect it's not worth doing in practice.
These are good arguments for having explicit open_text and open_binary functions. I don't *like* the idea, because they feel unnecessarily verbose to me, but I can accept that this might just be because I'm used to open(). I do think that having open_text, but *not* having open_binary, would be a bit confusing. Particularly as pathlib has read_text and read_binary, so it would be inconsistent as well.
This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again.
Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky.
This, on the other hand, I'm unequivocally against. The sheer quantity of breakage that would be caused by deprecating open() makes this a complete non-starter. Even if we only "deprecate in documentation", we'd be invalidating huge amounts of advice, books and training materials.
But having open(filename) do something different than Path(filename).open() seems like it's asking for trouble. It would be a source of a lot of unexpected bugs for people migrating from filenames as strings to pathlib, and the *last* thing you want during a migration is having to track down unexpected behavioural differences you hadn't planned for.
There's also (C) code that uses the default, where that default is already UTF-8. Which is probably most non-Windows systems. Those have no bug, and this change will make no difference to them. Also, (A) is "currently a bug, won't be a bug when the system encoding switches to UTF-8", whereas (B) is "currently correct, will remain correct when the system default becomes UTF-8". So switching Python's default can be seen as: (A) removes an existing bug a bit sooner. (B) introduces a bug which will go away again when the system switches to UTF-8 or the user changes their code. (C) makes no difference. Frankly, I don't think there's a good answer here, and there will likely be as many opinions as there are participants in the discussion. Paul ¹ I'm not 100% clear on what the polite form of address is for Japanese names, please let me know if I should be using a different form :-)

On Sun, Jan 24, 2021 at 6:33 PM Inada Naoki <songofacandy@gmail.com> wrote:
My previous thread is hijacked about "auto guessing" idea,
yes -- I'm a bit confused by that -- are folks advocating for making some sort of encoding detection the default? or available as an option in the stdlib? -- in any case, Ithink that could be an independent proposal. First: I really want to see this get pushed forward and get done, one way or another -- using a system setting as a default is a really bad idea in this day of interconnected computers. But back to PEP 597, and how to get there: 1) We need to start with a consensus about where we want Python to be in N versions. That is not specifically laid out in the PEP but it does imply that in the sometime-long-in-the-future: - TextIOWrapper will have utf-8 as the default, rather than `locale.getpreferredencoding(False)` this behaviour will then be inherited by: - `open()` without a binary flag in the mode - `Path.read_text` - there will be a string that can be passed to encoding that will indicate that the system default should be used. (and any other utility functions that use TextIOWrapper) Forgive me if there is already a consensus on this -- but this discussion has brought up some thoughts. 1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices. 2) Inada proposed an open_text() function be introduced as a stepping stone, with the new behaviour. This led to one person asking if that would imply a open_binary() function as well. An answer to that was no -- as no one is suggesting any changes to open()'s behavior for binary files. However, I kind of like the idea. We now have two (at least) different file objects potentially returned by open(): TextIOWrapper, and BufferedReader/Writer. And the TextIOWrapper has some pretty different behavior. I *think* that in virtually all cases, when the code is written, the author knows whether they want a binary or text file, so it may make sense to have two different open() functions, rather than having the Type returned be a function of what mode flags are passed. This would make it easier for people (and tools) to reason about the code with static analysis: e.g.: open_text().read() would return a string open_binary().read() would return bytes This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again. Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky. On to:
Path.open() was added in Python 3.4. Path.read_text() and
How do you think this idea?
+1 there is a lot less legacy with Path -- we can move faster. And I honestly still wonder if making utf-8 the default with cause or fix more bugs :-) A thought on that -- there is currently both kinds of code "in the wild": (A) code that uses the default, when they really want utf-8 -- currently a bug, won't be a bug in the future. (B) code that uses the default when it really does want the system encoding. -- currently correct, will become a bug in the future It's anyone's guess which of these is more common, but one thing to consider is that (A) is a hidden bug that might reveal itself in the hands of end users who knows when in the future. Whereas (B) will be a bug that is likely to reveal itself fairly quickly (though perhaps also in the (confused) hands of end users as well) -Chris B -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, 25 Jan 2021 at 20:02, Christopher Barker <pythonchb@gmail.com> wrote:
using a system setting as a default is a really bad idea in this day of interconnected computers.
I'd mildly dispute this. There are (significant) downsides with the default behaviour being system-dependent, yes, but there are *also* disadvantages in having Python not behave consistently with other tools/programs on the same system. However, on POSIX, things are generally consistent, and *already* default to UTF-8. So the proposal is mostly going to affect Windows. And on Windows, there's not much consistency even on a single machine at the moment. Between OEM and ANSI codepages, and other tools that default to UTF-8 "because that's the future", there's not much platform consistency for Python to conform to anyway...
There's a fundamental assumption here that I think needs to be made explicit. Which is that we're assuming that whatever N happens to be, we anticipate that `locale.getpreferredencoding(False)` will still be something other than UTF-8. That's *already* false on most POSIX systems, and TBH I get the impression that Microsoft is pushing quite hard to move Windows 10 to a UTF-8 by default position (although "fast" in Microsoft terms may still be slow to the rest of us ;-)) So I think that the real question here is "do we want to move Python to "UTF8-by-default" faster than the OS vendors are going? And I think that the answer to that is much less obvious. It probably also depends heavily on your locale - I doubt it's an accident that Inada-san¹ is proposing this, and he's from Japan :-) Personally, as an English speaker based in the UK, I'll be happy when UTF-8 is the default everywhere, but I can live with the status quo until that happens. But I'm not the main target for this change.
1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices.
*shrug*. That sounds plausible, but it's a backward compatibility break that doesn't offer any significant benefits, so I suspect it's not worth doing in practice.
These are good arguments for having explicit open_text and open_binary functions. I don't *like* the idea, because they feel unnecessarily verbose to me, but I can accept that this might just be because I'm used to open(). I do think that having open_text, but *not* having open_binary, would be a bit confusing. Particularly as pathlib has read_text and read_binary, so it would be inconsistent as well.
This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again.
Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky.
This, on the other hand, I'm unequivocally against. The sheer quantity of breakage that would be caused by deprecating open() makes this a complete non-starter. Even if we only "deprecate in documentation", we'd be invalidating huge amounts of advice, books and training materials.
But having open(filename) do something different than Path(filename).open() seems like it's asking for trouble. It would be a source of a lot of unexpected bugs for people migrating from filenames as strings to pathlib, and the *last* thing you want during a migration is having to track down unexpected behavioural differences you hadn't planned for.
There's also (C) code that uses the default, where that default is already UTF-8. Which is probably most non-Windows systems. Those have no bug, and this change will make no difference to them. Also, (A) is "currently a bug, won't be a bug when the system encoding switches to UTF-8", whereas (B) is "currently correct, will remain correct when the system default becomes UTF-8". So switching Python's default can be seen as: (A) removes an existing bug a bit sooner. (B) introduces a bug which will go away again when the system switches to UTF-8 or the user changes their code. (C) makes no difference. Frankly, I don't think there's a good answer here, and there will likely be as many opinions as there are participants in the discussion. Paul ¹ I'm not 100% clear on what the polite form of address is for Japanese names, please let me know if I should be using a different form :-)
participants (3)
-
Christopher Barker
-
Inada Naoki
-
Paul Moore