Re: PEP 597: Add optional EncodingWarning
On Wed, Feb 10, 2021 at 1:46 AM Anders Munch <ajm@flonidan.dk> wrote:
How about swapping around "locale" and None? Inada Naoki <songofacandy@gmail.com> wrote:
I thought it, but it may not work. Consider about function like this:
``` def read_text(self, encoding=None): with open(self._filename, encoding=encoding) as f: return f.read() ```
If `encoding=None` suppresses the warning, functions like this never warned.
I don't see why they should be. The author clearly knew about the encoding argument to open, they clearly intended for a None value to be given in some cases, and at the time of writing None meant to use a locale-dependent encoding.
We are not discussing about changing default encoding for now.
The section "Prepare to change the default encoding to UTF-8" gave me the impression that this was meant as a stepping stone on the way to doing just that. If that was not the intention, my apologies for the misread. regards, Anders
On Wed, Feb 10, 2021 at 11:58 PM Anders Munch <ajm@flonidan.dk> wrote:
On Wed, Feb 10, 2021 at 1:46 AM Anders Munch <ajm@flonidan.dk> wrote:
How about swapping around "locale" and None? Inada Naoki <songofacandy@gmail.com> wrote:
I thought it, but it may not work. Consider about function like this:
``` def read_text(self, encoding=None): with open(self._filename, encoding=encoding) as f: return f.read() ```
If `encoding=None` suppresses the warning, functions like this never warned.
I don't see why they should be. The author clearly knew about the encoding argument to open, they clearly intended for a None value to be given in some cases, and at the time of writing None meant to use a locale-dependent encoding.
It is not clear. The author may just want to "use the default encoding same to open()". If so, the caller of the function should be warned. To warn caller, this function can use `encoding=io.text_encoding(encoding)` as described in the PEP.
We are not discussing about changing default encoding for now.
The section "Prepare to change the default encoding to UTF-8" gave me the impression that this was meant as a stepping stone on the way to doing just that. If that was not the intention, my apologies for the misread.
This *can* be stepping stone. But it is not a frist goal. This PEP doesn't discourange omitting encoding option anytime soon when user really need to use locale encoding. Default encoding is used for: a. Really need to use locale specific encoding b. UTF-8 (bug. not work on Windows) c. ASCII (not a bug, but slow on Windows) I assume most usages are (b) and (c). This PEP can reduce them soon. If we decided to change the default encoding in the future, we need to warn omitting encoding option. Reducing (b) and (c) will reduce the total warning shown in the future. This is what "Prepare" means. Additionally, `encoding="locale"` will be backward/forward compatible way to use locale-specific encoding when we decided to change the default encoding. So this PEP can be a very important stepping stone. On the other hand, it is not a problem that we can not use `encoding="locale"` in backward-compatible code *for now*. Python 3.9 become EOL in 2025. We won't emit warning for the default encoding until then. People can use `encoding="locale"` after they drop Python 3.9 support. No problem. Regards, -- Inada Naoki <songofacandy@gmail.com>
Inada Naoki wrote:
Default encoding is used for:
a. Really need to use locale specific encoding b. UTF-8 (bug. not work on Windows) c. ASCII (not a bug, but slow on Windows)
I assume most usages are (b) and (c). This PEP can reduce them soon.
Is this just an assumption, based on those times being visible to someone who installs a lot of packages, or has the use of any locale other than UTF-8 and ASCII really gone down a lot? Have browsers stopped using charset sniffing?
Additionally, encoding="locale" will be backward/forward compatible
What would be the problem with changing the default from None to locale? (I think you mentioned that they are the same 99% of the time; is that other 1% likely to be cases where locale is wrong but None is right? Would there be a better way to represent that 1%?) -jJ
On Fri, Feb 12, 2021 at 5:18 AM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Inada Naoki wrote:
Default encoding is used for:
a. Really need to use locale specific encoding b. UTF-8 (bug. not work on Windows) c. ASCII (not a bug, but slow on Windows)
I assume most usages are (b) and (c). This PEP can reduce them soon.
Is this just an assumption, based on those times being visible to someone who installs a lot of packages, or has the use of any locale other than UTF-8 and ASCII really gone down a lot? Have browsers stopped using charset sniffing?
Using "most" is my fault. I am not good at Englsh. I should use "many" here. You can see many bugs caused by not specifying `encoding="utf-8"` in Q&A sites. I wrote some number about this common bugs in the PEP. UTF-8 is used for 96.3% of web sites [1], although browser still use charset sniffing. But how is it relating to this PEP? [1] https://w3techs.com/technologies/details/en-utf8
Additionally, encoding="locale" will be backward/forward compatible
What would be the problem with changing the default from None to locale?
It doesn't work on Python ~3.9. So using `encoding="locale"` is not recommended anytime soon until user drops Python 3.9 support.
(I think you mentioned that they are the same 99% of the time; is that other 1% likely to be cases where locale is wrong but None is right? Would there be a better way to represent that 1%?)
`encoding="locale"` and `encoding=None` has same behavior except `encoding="locale"` doesn't emit EncodingWarning even when it is opt-in. There is little difference between `encoding=None` and `encoding=locale.getpreferredencoding(False)`. The difference is: * When Python is using Windows, and * When when the file is console, and * (for open()) When PYTHONLEGACYWINDOWSSTDIO is set * (for TextIOWrapper()) When the file is not _WindowsConsoleIO encoding=None uses console codepage but encoding=locale.getpreferredencoding(False) uses Otherwise, encoding=None and encoding=locale.getpreferredencoding(False) are same. So `encoding=locale.getpreferredencoding(False)` can be used to specify locale-specific encoding explicitly. But this PEP doesn't recommend it. This PEP recommend to use EncodingWarning for just finding missing `encoding="utf-8"` (or any other specific encoding). -- Inada Naoki <songofacandy@gmail.com>
On 2/11/21, Inada Naoki <songofacandy@gmail.com> wrote:
There is little difference between `encoding=None` and `encoding=locale.getpreferredencoding(False)`. The difference is:
* When Python is using Windows, and * When when the file is console, and * (for open()) When PYTHONLEGACYWINDOWSSTDIO is set * (for TextIOWrapper()) When the file is not _WindowsConsoleIO
encoding=None uses console codepage but
os.device_encoding() -- i.e. _Py_device_encoding() -- only works for hard-coded file descriptors 0, 1, and 2, instead of detecting a console file. So opening "CON", "CONIN$", or "CONOUT$" has never used the console input or output code page, nor has opening a duped standard I/O fd such as open(os.dup(0)). It would be easy to generalize _Py_device_encoding() to detect console files, but it's new behavior. Python 3.8+ introduced a bug (issue 42261) in which, even with legacy standard I/O enabled and file descriptors 0-2, the console input and output code pages are ignored. For example: C:\>chcp 437 Active code page: 437 C:\>set PYTHONLEGACYWINDOWSSTDIO=1 C:\>py -3.9 -c "import sys; print(sys.stdout.encoding)" cp1252 Regarding the last bullet point, io.TextIOWrapper doesn't know anything about io._WindowsConsoleIO. The decision to use UTF-8 is in io.open(). So manually wrapping a _WindowsConsoleIO file with TextIOWrapper uses the locale preferred encoding instead of UTF-8. For example: >>> fb = open('conin$', 'rb') >>> fb.raw <_io._WindowsConsoleIO mode='rb' closefd=True> >>> f = io.TextIOWrapper(fb) >>> f.encoding 'cp1252' I don't know whether it's worth making TextIOWrapper check for _WindowsConsoleIO in order to make it use UTF-8. It's not common to manually wrap a binary-mode file.
(I apologize if my summaries distort what Inada Naoki <songofacandy@gmail.com> explained.) He said that some people use the default None when they really want either UTF-8 or ASCII. My concern is that the warning will be a false alarm if they really do need whatever locale returns, and that case may still be common. (If web browsers had stopped bothering to sniff for other charsets, then maybe that situation really was getting rare.) I asked when encoding=None is actually different from encoding=locale, currently spelled encoding=locale.getpreferredencoding(False) They can be different on Windows console, presumably because the environment settings that control locale may differ from the charset actually used by the console. Even then, it only differs for open() when PYTHONLEGACYWINDOWSSTDIO is set, and for TextIOWrapper() When the file is not _WindowsConsoleIO To me, that sounds narrow enough to be a windows issue, rather than an issue with open. Is there some way to write an encoding that sniffs for charsets, particularly on windows, and to use that as the default instead of assuming that locale will be correct? -jJ
On Fri, Feb 12, 2021 at 12:28 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
(I apologize if my summaries distort what Inada Naoki <songofacandy@gmail.com> explained.)
He said that some people use the default None when they really want either UTF-8 or ASCII.
Yes. Even Python core developers. For example: https://bugs.python.org/issue33684 This is just one example. I saw many codes using default encoding to read JSON, YAML, TOML, Markdown, etc...
My concern is that the warning will be a false alarm if they really do need whatever locale returns, and that case may still be common. (If web browsers had stopped bothering to sniff for other charsets, then maybe that situation really was getting rare.)
That's one of reason why this warning is opt-in, like BytesWarning.
I asked when encoding=None is actually different from encoding=locale, currently spelled encoding=locale.getpreferredencoding(False)
I don't understand this sentence. This PEP proposes `encoding="locale"` that is equal to encoding=None but don't emit EncodingWarning. There was discussion about difference between `encoding=None` and `encoding=locale.getpreferredencoding(False)` in this thread.
They can be different on Windows console, presumably because the environment settings that control locale may differ from the charset actually used by the console. Even then, it only differs for open() when PYTHONLEGACYWINDOWSSTDIO is set, and for TextIOWrapper() When the file is not _WindowsConsoleIO
To me, that sounds narrow enough to be a windows issue, rather than an issue with open.
Yes. So if user want to specify locale-specific encoding and don't want to drop Python 3.9 support, user can use encoding=locale.getpreferredencoding(False). But this PEP doesn't recommend it. Third party libraries can use `encoding="locale"` after they drop Python 3.9 support.
Is there some way to write an encoding that sniffs for charsets, particularly on windows, and to use that as the default instead of assuming that locale will be correct?
-jJ
There is no reliable way, AFAIK. -- Inada Naoki <songofacandy@gmail.com>
Offering encoding="locale" (or open.locale or ... ) instead of a long function call using False (locale.getpreferredencoding(False)) seems like a win for Explicit is Better Than Implicit. It would then be possible to say "yeah, locale really is what I meant". Err... unless the charset determination is so tricky that it ends up just adding another not-quite-right near-but-not-exact-synonym. Adding a new Warning subclass, and maybe a new warning type, and maybe a new environment variable, and maybe a new launch flag ... these all seem to risk just making things more complicated without sufficient gain. Would a recipe for site-packages be sufficient, or does this need to run too early in the bootstrapping process? -jJ
On Sat, Feb 13, 2021 at 4:53 AM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Offering encoding="locale" (or open.locale or ... ) instead of a long function call using False (locale.getpreferredencoding(False)) seems like a win for Explicit is Better Than Implicit. It would then be possible to say "yeah, locale really is what I meant".
Err... unless the charset determination is so tricky that it ends up just adding another not-quite-right near-but-not-exact-synonym.
Adding a new Warning subclass, and maybe a new warning type, and maybe a new environment variable, and maybe a new launch flag ... these all seem to risk just making things more complicated without sufficient gain.
Would a recipe for site-packages be sufficient, or does this need to run too early in the bootstrapping process?
-jJ
What does "a recipe for site-packages" mean? -- Inada Naoki <songofacandy@gmail.com>
In the documentation (not sure whether it should be the documentation for "open" or for encoding), include at least a link to instructions on how to (try to) verify that your codebase is using the encoding parameter properly. Those instructions would say something like "Add the following lines to end of Lib\site.py: _origopen=open def open(...): if ... warnings.warn(...) _origopen(...) " -jJ On Fri, Feb 12, 2021 at 6:28 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sat, Feb 13, 2021 at 4:53 AM Jim J. Jewett <jimjjewett@gmail.com> wrote:
Offering encoding="locale" (or open.locale or ... ) instead of a long function call using False (locale.getpreferredencoding(False)) seems like a win for Explicit is Better Than Implicit. It would then be possible to say "yeah, locale really is what I meant".
Err... unless the charset determination is so tricky that it ends up just adding another not-quite-right near-but-not-exact-synonym.
Adding a new Warning subclass, and maybe a new warning type, and maybe a new environment variable, and maybe a new launch flag ... these all seem to risk just making things more complicated without sufficient gain.
Would a recipe for site-packages be sufficient, or does this need to run too early in the bootstrapping process?
-jJ
What does "a recipe for site-packages" mean?
-- Inada Naoki <songofacandy@gmail.com>
participants (4)
-
Anders Munch
-
Eryk Sun
-
Inada Naoki
-
Jim J. Jewett