Re: PEP 597: Add optional EncodingWarning

Victor Stinner [mailto:vstinner@python.org] wrote:
The warning can explicitly suggest to use encoding="utf8", it should work in almost all cases.
The warning should also explain how to get backwards-compatible behaviour, i.e. suggest encoding="locale". Inada Naoki <songofacandy@gmail.com> wrote:
This warning is opt-in warning like BytesWarning.
What use is a warning that no-one sees? When the default is switched to encoding="utf8", it will break software, and people need to be warned of that. UnicodeDecodeError's will abound when files that used to be read in a single-byte encoding fails to decode as utf-8. All it takes is a single é. If the default encoding is ever to change, there's no way around a noisy warning. How about swapping around "locale" and None? That is, make "locale" the new default that emits a warning, and encoding=None emits no warning. That has the advantage that old code can be updated to say encoding=None, and then it will work on both old and new Pythons without warning. regards, Anders

On Tue, 9 Feb 2021 at 16:52, Anders Munch <ajm@flonidan.dk> wrote:
How about swapping around "locale" and None? That is, make "locale" the new default that emits a warning, and encoding=None emits no warning. That has the advantage that old code can be updated to say encoding=None, and then it will work on both old and new Pythons without warning.
I don't understand why working code should have to change *twice*. I'm fine with the idea that people *actually* relying on the current default will need to switch when the default changes, but making them change once to silence the warning and then again to explicitly select the old default is pretty annoying. If we don't want people to use the default encoding, we should just make encoding a required argument and stop pretending. If omitting the encoding and using the default is intended to be a supported usage, then we should *not* penalise people doing that. Changing the default is a backward-incompatible change, that's enough of an inconvenience. Changing the (behaviour of the) default *twice* is just making things worse. Paul

On Wed, Feb 10, 2021 at 1:46 AM Anders Munch <ajm@flonidan.dk> wrote:
At least, I see. We can fix stdlib and tests first, and fix some major tools too. After that, `encoding="locale"` becomes backward/forward compatible at some point.
Please read the PEP and some my posts in this threads. We are not discussing about changing default encoding for now. This PEP provides a tool to find missing `encoding="utf-8"` bug for now. The goal of the PEP is encourage `encoding="utf-8"` when the user assumes encoding is UTF-8. If we decide to change the default encoding. EncodingWarning can be used to discourage omitting the `encoding` option. But it is out of scope of the PEP. We don't discourage omitting encoding option in Python 3.10.
How about swapping around "locale" and None? That is, make "locale" the new default that emits a warning, and encoding=None emits no warning. That has the advantage that old code can be updated to say encoding=None, and then it will work on both old and new Pythons without warning.
I thought it, but it may not work. Consider about function like this: ``` def read_text(self, encoding=None): with open(self._filename, encoding=encoding) as f: return f.read() ``` If `encoding=None` suppresses the warning, functions like this never warned. So I think current PEP is better. If users want to use locale encoding, they don't need to fix the warning anytime soon. They can wait to drop Python 3.9 support. If they want to fix all warnings soon, they can `encoding=locale.getpreferredencoding(False)`. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Tue, Feb 9, 2021 at 5:51 PM Anders Munch <ajm@flonidan.dk> wrote:
encoding="utf8" is backward compatible and is likely to fix encoding bugs when the locale encoding is not UTF-8. It is likely what the developer expected, without knowing that open(filename) does not always use UTF-8. See PEP 597 rationale. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

I just reread PEP 597, then re-reread the Rationale. The PEP helps when the locale is ASCII or C, but that isn't enforced in actual files. I am confident that this is a frequent problem for packages downloaded from mostly-English sites, including many software repositories. It does not seem to be a win when the locale is something incompatible with utf-8, such as Latin-1, or whatever is still common in Japan. The surrogate-escape mechanism allows a proper round-trip, but python itself will stop processing the characters correctly. For interactive use, when talking to another program (such as a terminal) instead of an already existing file, the backwards compatibility problem seems worse. Changing the default to utf-8 (after a deprecation period showing how to make locale an explicit default) may be reasonable, but claiming that it is backwards compatible ... I didn't get that impression from the PEP. -jJ

On Thu, Feb 11, 2021 at 4:44 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
I just reread PEP 597, then re-reread the Rationale.
Do you read current PEP 597, or old PEP 597 in discuss.python.org? -- Inada Naoki <songofacandy@gmail.com>

On Thu, Feb 11, 2021 at 4:44 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
The PEP helps when the locale is ASCII or C, but that isn't enforced in actual files. I am confident that this is a frequent problem for packages downloaded from mostly-English sites, including many software repositories.
The PEP helps developers living on UTF-8 locale to find missing `encoding="utf-8"` bug. This type of bug is very common, and many Windows users are suffered by the bug when reading JSON, YAML, TOML, Markdown, or any other UTF-8 files.
It does not seem to be a win when the locale is something incompatible with utf-8, such as Latin-1, or whatever is still common in Japan. The surrogate-escape mechanism allows a proper round-trip, but python itself will stop processing the characters correctly.
Surrogate-escape mechanism doesn't relating this PEP.
For interactive use, when talking to another program (such as a terminal) instead of an already existing file, the backwards compatibility problem seems worse.
This PEP is 100% backward compatible.
Changing the default to utf-8 (after a deprecation period showing how to make locale an explicit default) may be reasonable, but claiming that it is backwards compatible ... I didn't get that impression from the PEP.
This PEP doesn't propose to change the default encoding. *If* we decide to change the default encoding in the future (maybe, 2025 or later) and start emitting DeprecationWarning where `encoding` option is omitted, this PEP help it by: * `encoding="locale"` option can be used since Python 3.10, and * The number of DeprecationWarning shown is decreased because we can add `encoding="utf-8"` many places before the time. At least, we can fix all EncodingWarning in stdlib. Maybe, the "Prepare to change the default encoding to UTF-8" is misleading. I will try to fix the section or remove the section. -- Inada Naoki <songofacandy@gmail.com>

On Thu, Feb 11, 2021 at 7:35 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think this is where we have been talking past each other. You seem to be assuming that the programmer knows the correct encoding, presumably because they (or their program) wrote it. You then assume that they neglected to mention the encoding out of forgetfulness, perhaps because on their system, everything is always UTF-8. This clearly does happen, but the people who would make this mistake most often -- they probably wouldn't think to test their code under a special mode that catches only this. (They might run a linter that looked for all sorts of problems, including this.) I instead assume that the programmer really doesn't know the encoding, because the file is supplied by the user. (The user may not know either, since it is really supplied by some other program, but ... neither python nor the programmer knows for sure.) In this case, the warning is not just a false alarm, but is actively misleading. -jJ

On Fri, Feb 12, 2021 at 12:45 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Not always, but many times.
Some Python experts can write `export PYTHONWARNENCODING=1` in their .bashrc. They can find such mistakes not only in their codes but also in libraries they are using. Since they are experts, they can understand the warning and report it to the library author correctly. So this option helps library authors even if they don't use this option.
This option is opt-in. People don't understand what this warning means should not opt-in the warning. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Tue, 9 Feb 2021 at 16:52, Anders Munch <ajm@flonidan.dk> wrote:
How about swapping around "locale" and None? That is, make "locale" the new default that emits a warning, and encoding=None emits no warning. That has the advantage that old code can be updated to say encoding=None, and then it will work on both old and new Pythons without warning.
I don't understand why working code should have to change *twice*. I'm fine with the idea that people *actually* relying on the current default will need to switch when the default changes, but making them change once to silence the warning and then again to explicitly select the old default is pretty annoying. If we don't want people to use the default encoding, we should just make encoding a required argument and stop pretending. If omitting the encoding and using the default is intended to be a supported usage, then we should *not* penalise people doing that. Changing the default is a backward-incompatible change, that's enough of an inconvenience. Changing the (behaviour of the) default *twice* is just making things worse. Paul

On Wed, Feb 10, 2021 at 1:46 AM Anders Munch <ajm@flonidan.dk> wrote:
At least, I see. We can fix stdlib and tests first, and fix some major tools too. After that, `encoding="locale"` becomes backward/forward compatible at some point.
Please read the PEP and some my posts in this threads. We are not discussing about changing default encoding for now. This PEP provides a tool to find missing `encoding="utf-8"` bug for now. The goal of the PEP is encourage `encoding="utf-8"` when the user assumes encoding is UTF-8. If we decide to change the default encoding. EncodingWarning can be used to discourage omitting the `encoding` option. But it is out of scope of the PEP. We don't discourage omitting encoding option in Python 3.10.
How about swapping around "locale" and None? That is, make "locale" the new default that emits a warning, and encoding=None emits no warning. That has the advantage that old code can be updated to say encoding=None, and then it will work on both old and new Pythons without warning.
I thought it, but it may not work. Consider about function like this: ``` def read_text(self, encoding=None): with open(self._filename, encoding=encoding) as f: return f.read() ``` If `encoding=None` suppresses the warning, functions like this never warned. So I think current PEP is better. If users want to use locale encoding, they don't need to fix the warning anytime soon. They can wait to drop Python 3.9 support. If they want to fix all warnings soon, they can `encoding=locale.getpreferredencoding(False)`. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Tue, Feb 9, 2021 at 5:51 PM Anders Munch <ajm@flonidan.dk> wrote:
encoding="utf8" is backward compatible and is likely to fix encoding bugs when the locale encoding is not UTF-8. It is likely what the developer expected, without knowing that open(filename) does not always use UTF-8. See PEP 597 rationale. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

I just reread PEP 597, then re-reread the Rationale. The PEP helps when the locale is ASCII or C, but that isn't enforced in actual files. I am confident that this is a frequent problem for packages downloaded from mostly-English sites, including many software repositories. It does not seem to be a win when the locale is something incompatible with utf-8, such as Latin-1, or whatever is still common in Japan. The surrogate-escape mechanism allows a proper round-trip, but python itself will stop processing the characters correctly. For interactive use, when talking to another program (such as a terminal) instead of an already existing file, the backwards compatibility problem seems worse. Changing the default to utf-8 (after a deprecation period showing how to make locale an explicit default) may be reasonable, but claiming that it is backwards compatible ... I didn't get that impression from the PEP. -jJ

On Thu, Feb 11, 2021 at 4:44 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
I just reread PEP 597, then re-reread the Rationale.
Do you read current PEP 597, or old PEP 597 in discuss.python.org? -- Inada Naoki <songofacandy@gmail.com>

On Thu, Feb 11, 2021 at 4:44 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
The PEP helps when the locale is ASCII or C, but that isn't enforced in actual files. I am confident that this is a frequent problem for packages downloaded from mostly-English sites, including many software repositories.
The PEP helps developers living on UTF-8 locale to find missing `encoding="utf-8"` bug. This type of bug is very common, and many Windows users are suffered by the bug when reading JSON, YAML, TOML, Markdown, or any other UTF-8 files.
It does not seem to be a win when the locale is something incompatible with utf-8, such as Latin-1, or whatever is still common in Japan. The surrogate-escape mechanism allows a proper round-trip, but python itself will stop processing the characters correctly.
Surrogate-escape mechanism doesn't relating this PEP.
For interactive use, when talking to another program (such as a terminal) instead of an already existing file, the backwards compatibility problem seems worse.
This PEP is 100% backward compatible.
Changing the default to utf-8 (after a deprecation period showing how to make locale an explicit default) may be reasonable, but claiming that it is backwards compatible ... I didn't get that impression from the PEP.
This PEP doesn't propose to change the default encoding. *If* we decide to change the default encoding in the future (maybe, 2025 or later) and start emitting DeprecationWarning where `encoding` option is omitted, this PEP help it by: * `encoding="locale"` option can be used since Python 3.10, and * The number of DeprecationWarning shown is decreased because we can add `encoding="utf-8"` many places before the time. At least, we can fix all EncodingWarning in stdlib. Maybe, the "Prepare to change the default encoding to UTF-8" is misleading. I will try to fix the section or remove the section. -- Inada Naoki <songofacandy@gmail.com>

On Thu, Feb 11, 2021 at 7:35 PM Inada Naoki <songofacandy@gmail.com> wrote:
I think this is where we have been talking past each other. You seem to be assuming that the programmer knows the correct encoding, presumably because they (or their program) wrote it. You then assume that they neglected to mention the encoding out of forgetfulness, perhaps because on their system, everything is always UTF-8. This clearly does happen, but the people who would make this mistake most often -- they probably wouldn't think to test their code under a special mode that catches only this. (They might run a linter that looked for all sorts of problems, including this.) I instead assume that the programmer really doesn't know the encoding, because the file is supplied by the user. (The user may not know either, since it is really supplied by some other program, but ... neither python nor the programmer knows for sure.) In this case, the warning is not just a false alarm, but is actively misleading. -jJ

On Fri, Feb 12, 2021 at 12:45 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Not always, but many times.
Some Python experts can write `export PYTHONWARNENCODING=1` in their .bashrc. They can find such mistakes not only in their codes but also in libraries they are using. Since they are experts, they can understand the warning and report it to the library author correctly. So this option helps library authors even if they don't use this option.
This option is opt-in. People don't understand what this warning means should not opt-in the warning. Regards, -- Inada Naoki <songofacandy@gmail.com>
participants (5)
-
Anders Munch
-
Inada Naoki
-
Jim J. Jewett
-
Paul Moore
-
Victor Stinner