Adding `open_text()` builtin function. (relating to PEP 597)
Hi, all. I am rewriting PEP 597 to introduce a new EncodingWarning, which subclass of DeprecationWarning and used to warn about future default encoding change. But I don't think we can change the default encoding of `io.TextIOWrapper` and built-in `open()` anytime soon. It is disruptive change. It may take 10 or more years. To ease the pain caused by "default encoding is not UTF-8 (almost) only on Windows" (*), I came up with another idea. This idea is not mutually exclusive with PEP 597, but I want to include it in the PEP because both ideas use EncodingWarning. (*) Imagine that a new Python user writes a text file with notepad.exe (default encoding is UTF-8 without BOM already) or VS Code, and try to read it in Jupyter Notebook. They will see UnicodeDecodeError. They might not know about what encoding yet. ## 1. Add `io.open_text()`, builtin `open_text()`, and `pathlib.Path.open_text()`. All functions are same to `io.open()` or `Path.open()`, except: * Default encoding is "utf-8". * "b" is not allowed in the mode option. These functions have two benefits: * `open_text(filename)` is shorter than `open(filename, encoding="utf-8")`. Its easy to type especially with autocompletion. * Type annotation for returned value is simple than `open`. It is always TextIOWrapper. ## 2. Change the default encoding of `pathlib.Path.read_text()`. For convenience and consistency with `Path.open_text()`, change the default encoding of `Path.read_text()` to "utf-8" with regular deprecation period. * Python 3.10: `Path.read_text()` emits EncodingWarning when the encoding option is omitted. * Python 3.13: `Path.read_text()` change the default encoding to "utf-8". If PEP 597 is accepted, users can pass `encoding="locale"` instead of `encoding=locale.getpreferredencoding(False)` when they need to use locale encoding. We might change more places where the default encoding is used. But it should be done slowly and carefully. --- How do you think about this idea? Is this worth enough to add a new built-in function? Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki <songofacandy@gmail.com> wrote:
## 1. Add `io.open_text()`, builtin `open_text()`, and `pathlib.Path.open_text()`.
All functions are same to `io.open()` or `Path.open()`, except:
* Default encoding is "utf-8". * "b" is not allowed in the mode option.
I *really* don't like this, because it implies that open() will open in binary mode.
How do you think about this idea? Is this worth enough to add a new built-in function?
Highly dubious. I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name. What exactly are the blockers on making open(fn) use UTF-8 by default? Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use? ChrisA
On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico <rosuav@gmail.com> wrote:
Highly dubious. I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward compatibility. That's what PEP 597 tries to solve. 1. Add optional warning for `open()` call without specifying `encoding` option. (PEP 597) 2. (Several years later) Make the warning default. 3. (Several years later) Change the default encoding. When (2) happens, users are forced to write `encoding="utf-8"` to suppress the warning. But note that the default encoding is "utf-8" already in (most) Linux including WSL, macOS, iOS, and Android. And Windows user can read ASCII text files without specifying `encoding` regardless default encoding is legacy codec or "utf-8". So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious job. On the other hand, if we add `open_text()`: * Replacing open with open_text is easier than adding `, encoding="utf-8"`. * Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is. So `open_text()` can provide better developer experience, without waiting 10 years.
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
Ultimate goal is make the "utf-8" default. But I don't know when we can change it. So I focus on what we can do in near future (< 5 years, I hope). Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sat, Jan 23, 2021 at 9:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico <rosuav@gmail.com> wrote:
Highly dubious. I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward compatibility. That's what PEP 597 tries to solve.
1. Add optional warning for `open()` call without specifying `encoding` option. (PEP 597) 2. (Several years later) Make the warning default. 3. (Several years later) Change the default encoding.
When (2) happens, users are forced to write `encoding="utf-8"` to suppress the warning.
But note that the default encoding is "utf-8" already in (most) Linux including WSL, macOS, iOS, and Android. And Windows user can read ASCII text files without specifying `encoding` regardless default encoding is legacy codec or "utf-8". So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious job.
Okay, but this (a) has a good end goal, and (b) is only backward-incompatible with its default - adding the encoding parameter makes your code compatible with all versions of Python.
On the other hand, if we add `open_text()`:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`. * Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
So `open_text()` can provide better developer experience, without waiting 10 years.
But this has a far worse end goal - two open functions with subtly incompatible defaults, and a big question of "why should I choose this over that". And if you start using open_text, suddenly your code won't work on older Pythons.
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
Ultimate goal is make the "utf-8" default. But I don't know when we can change it. So I focus on what we can do in near future (< 5 years, I hope).
Okay. If the goal is to make UTF-8 the default, may I request that PEP 597 say so, please? With a heading of "deprecation", it's not really clear what its actual goal is. From the sound of things - and it's still possible I'm misreading PEP 597, my apologies if so - this open_text function wouldn't really solve anything much, and the original goal of "change the default encoding to UTF-8" is better served by 597. ChrisA
On Sat, Jan 23, 2021 at 7:13 PM Chris Angelico <rosuav@gmail.com> wrote:
On the other hand, if we add `open_text()`:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`. * Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
So `open_text()` can provide better developer experience, without waiting 10 years.
But this has a far worse end goal - two open functions with subtly incompatible defaults, and a big question of "why should I choose this over that". And if you start using open_text, suddenly your code won't work on older Pythons.
Yes. There is cons too. That's why I posted this thread before including the idea in the PEP. Thank you for your feedback.
Ultimate goal is make the "utf-8" default. But I don't know when we can change it. So I focus on what we can do in near future (< 5 years, I hope).
Okay. If the goal is to make UTF-8 the default, may I request that PEP 597 say so, please? With a heading of "deprecation", it's not really clear what its actual goal is.
No. I avoid it intentionally. I am making the PEP useful even if we can not change the default encoding. The PEP can be discussed without discussing we can change the default encoding or not. Please read the first motivation section in the PEP. https://www.python.org/dev/peps/pep-0597/#using-the-default-encoding-is-a-co... Regards, -- Inada Naoki <songofacandy@gmail.com>
On 2021-01-23 10:11, Chris Angelico wrote: [snip]
Okay. If the goal is to make UTF-8 the default, may I request that PEP 597 say so, please? With a heading of "deprecation", it's not really clear what its actual goal is.
From the sound of things - and it's still possible I'm misreading PEP 597, my apologies if so - this open_text function wouldn't really solve anything much, and the original goal of "change the default encoding to UTF-8" is better served by 597.
I use Windows and I switched to UTF-8 years ago. However, the standard on Windows is 'utf-8-sig', so I'd probably prefer it if the default when _reading_ was 'utf-8-sig'. (I'm not bothered about writing; I can still be explicit if I want 'utf-8-sig' for Windows-specific UTF-8 files.)
On Sat, Jan 23, 2021 at 09:11:27PM +1100, Chris Angelico wrote:
On the other hand, if we add `open_text()`:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`. * Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
So `open_text()` can provide better developer experience, without waiting 10 years.
But this has a far worse end goal - two open functions with subtly incompatible defaults, and a big question of "why should I choose this over that".
It has an easy answer: - Are you opening a text file and you don't know about or want to deal with encodings? Use `open_text`. - Otherwise, use `open`. I think that if we moved to an open_text() builtin, it should have the simplest possible signature: open_text(filename, mode='r') If you care about anything beyond that, use `open`.
And if you start using open_text, suddenly your code won't work on older Pythons.
"Using older Pythons" is mostly a concern for library maintainers, not beginners. A few years from now, Python 3.10 will be the oldest version the great majority of beginners will care about, and 3.9 will be as irrelevant to them as 3.4 is to us today. Library maintainers always have to deal with the issue of not being able to use the newest functionality, it doesn't prevent us from adding new functionality. -- Steve
On Mon, Jan 25, 2021 at 4:42 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 09:11:27PM +1100, Chris Angelico wrote:
On the other hand, if we add `open_text()`:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`. * Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
So `open_text()` can provide better developer experience, without waiting 10 years.
But this has a far worse end goal - two open functions with subtly incompatible defaults, and a big question of "why should I choose this over that".
It has an easy answer:
- Are you opening a text file and you don't know about or want to deal with encodings? Use `open_text`.
- Otherwise, use `open`.
I think that if we moved to an open_text() builtin, it should have the simplest possible signature:
open_text(filename, mode='r')
If you care about anything beyond that, use `open`.
And if you start using open_text, suddenly your code won't work on older Pythons.
"Using older Pythons" is mostly a concern for library maintainers, not beginners. A few years from now, Python 3.10 will be the oldest version the great majority of beginners will care about, and 3.9 will be as irrelevant to them as 3.4 is to us today.
Library maintainers always have to deal with the issue of not being able to use the newest functionality, it doesn't prevent us from adding new functionality.
Older Pythons may be easy to drop, but I'm not so sure about older unofficial docs. The open() function is very popular and there must be millions of blog posts with examples using it, most of them reading text files (written by bloggers naive in Python but good at SEO). I would be very sad if the official recommendation had to become "[for the most common case] avoid open(filename), use open_text(filename)". BTW remind me what open_text() would do? How would it differ from open() with the same arguments? That's too many messages back. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum <guido@python.org> wrote:
Older Pythons may be easy to drop, but I'm not so sure about older unofficial docs. The open() function is very popular and there must be millions of blog posts with examples using it, most of them reading text files (written by bloggers naive in Python but good at SEO).
I would be very sad if the official recommendation had to become "[for the most common case] avoid open(filename), use open_text(filename)".
I agree that. But until we switch to the default encoding of open(), we must recommend to avoid `open(filename)` anyway. The default encoding of VS Code, Atom, Notepad is already UTF-8. Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`. (*) https://docs.python.org/3.10/tutorial/inputoutput.html#reading-and-writing-f...
BTW remind me what open_text() would do? How would it differ from open() with the same arguments? That's too many messages back.
Current proposal is "open_utf8()". The differences from open() are: * There is no encoding parameter. It uses "utf-8" always. (*) * "b" is not allowed for mode. (*) Another option is to use "utf-8-sig" for reading and "utf-8" for writing. But it has some drawbacks. utf-8-sig has overhead because it is a wrapper implemented in Python. And TextIOWrapper has fast-paths for utf-8, but not for utf-8-sig. "utf-8-sig" may be not tested well compared to "utf-8". Regards, -- Inada Naoki <songofacandy@gmail.com>
On Mon, Jan 25, 2021 at 8:51 PM Inada Naoki <songofacandy@gmail.com> wrote:
Older Pythons may be easy to drop, but I'm not so sure about older unofficial docs. The open() function is very popular and there must be millions of blog posts with examples using it, most of them reading text files (written by bloggers naive in Python but good at SEO).
I would be very sad if the official recommendation had to become "[for
On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum <guido@python.org> wrote: the most common case] avoid open(filename), use open_text(filename)".
I agree that. But until we switch to the default encoding of open(), we must recommend to avoid `open(filename)` anyway. The default encoding of VS Code, Atom, Notepad is already UTF-8.
Maybe we're overthinking this - do we really need to recommend avoiding `open(filename)` in all cases? Isn't it just fine to use if `locale.getpreferredencoding(False)` is UTF-8, since in that case there won't be any change in behavior when `open` switches from the old, locale-specific default to the new, always UTF-8 default? If that's the case, then it would be less of a backwards incompatibility issue, since most production environments will already be using UTF-8 as the locale (by virtue of it being the norm on Unix systems and servers). And if that's the case, all we need is a warning that is raised conditionally when open() is called for text mode without an explicit encoding when the system locale is not UTF-8, and that warning can say something like: Your system is currently configured to use shift_jis for text files. Beginning in Python 3.13, open() will always use utf-8 for text files instead. For compatibility with future Python versions, pass open() the extra argument: encoding="shift_jis" ~Matt
On Mon, Jan 25, 2021 at 5:49 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Tue, Jan 26, 2021 at 10:22 AM Guido van Rossum <guido@python.org> wrote:
Older Pythons may be easy to drop, but I'm not so sure about older
unofficial docs. The open() function is very popular and there must be millions of blog posts with examples using it, most of them reading text files (written by bloggers naive in Python but good at SEO).
I would be very sad if the official recommendation had to become "[for
the most common case] avoid open(filename), use open_text(filename)".
I agree that. But until we switch to the default encoding of open(), we must recommend to avoid `open(filename)` anyway. The default encoding of VS Code, Atom, Notepad is already UTF-8.
Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`.
Telling people to always add `encoding='utf8'` makes much more sense to me than introducing a new function and telling them to do that. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Tue, Jan 26, 2021 at 3:07 PM Guido van Rossum <guido@python.org> wrote:
I agree that. But until we switch to the default encoding of open(), we must recommend to avoid `open(filename)` anyway. The default encoding of VS Code, Atom, Notepad is already UTF-8.
Maybe, we need to update the tutorial (*) to use `encoding="utf-8"`.
Telling people to always add `encoding='utf8'` makes much more sense to me than introducing a new function and telling them to do that.
Ok, I will not add open_utf8() to PEP 597, and update the tutorial to recommend `encoding="utf-8"`. -- Inada Naoki <songofacandy@gmail.com>
Hello, On Sat, 23 Jan 2021 19:04:08 +0900 Inada Naoki <songofacandy@gmail.com> wrote:
On Sat, Jan 23, 2021 at 10:47 AM Chris Angelico <rosuav@gmail.com> wrote:
Highly dubious. I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward compatibility. That's what PEP 597 tries to solve.
1. Add optional warning for `open()` call without specifying `encoding` option. (PEP 597) 2. (Several years later) Make the warning default. 3. (Several years later) Change the default encoding.
When (2) happens, users are forced to write `encoding="utf-8"` to suppress the warning.
But note that the default encoding is "utf-8" already in (most) Linux including WSL, macOS, iOS, and Android. And Windows user can read ASCII text files without specifying `encoding` regardless default encoding is legacy codec or "utf-8". So adding `, encoding="utf-8"` everywhere `open()` is used might be tedious job.
On the other hand, if we add `open_text()`:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`.
How is it easier, if "open_text" exists only in imagination, while encoding="utf-8" has been there all this time? The only easier thing than adding 'encoding="utf-8"' would be: 1. Just go ahead and switch the default encoding to utf-8 right away. 2. For backward compatibility, add "python3 --backward-compatibility" switch. Perhaps even tell users to use it straight in the UnicodeDecodeError backtrace.
* Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
Let's also add max_int(), min_int(), max_float(), min_float() builtins. Teachers can teach that if you need to min ints, then to use min_int(), if you need to min floats, then to use min_float(), and otherwise, use min(). Bonus point: max_int(), min_int(), max_float(), min_float() are all easier to annotate.
So `open_text()` can provide better developer experience, without waiting 10 years.
Except that in 10 years, when the default encoding is finally changed, open_text() is a useless function, which now needs to be deprecated and all the fun process repeated again. [] -- Best regards, Paul mailto:pmiscml@gmail.com
On Sat, Jan 23, 2021 at 7:31 PM Paul Sokolovsky <pmiscml@gmail.com> wrote:
* Replacing open with open_text is easier than adding `, encoding="utf-8"`.
How is it easier, if "open_text" exists only in imagination, while encoding="utf-8" has been there all this time?
Note that the warning is not enabled by default anytime soon. If we decide to change the default encoding and enable the EncodingWarning by default in Python 3.15, user can use `open_text()` for 3.10~3.15. It will be enough backward compatibility for most users.
* Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
Let's also add max_int(), min_int(), max_float(), min_float() builtins.
It is off-topic. Please don't compare apple and orange.
So `open_text()` can provide better developer experience, without waiting 10 years.
Except that in 10 years, when the default encoding is finally changed, open_text() is a useless function, which now needs to be deprecated and all the fun process repeated again.
Yes, if we can change the default encoding in 2030, two open functions will become messy. But there is no promise for the change. Without mitigating the pain, we can not change the default encoding forever. Anyway, thank you for your feedback. Two people prefer `encoding="utf-8"` to `open_text()`. I still wait for feedbacks from more people before updating the PEP 597. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sat, Jan 23, 2021 at 01:31:28PM +0300, Paul Sokolovsky wrote:
* Teachers can teach to use `open_text` to open text files. Students can use "utf-8" by default without knowing about what encoding is.
Let's also add max_int(), min_int(), max_float(), min_float() builtins. Teachers can teach that if you need to min ints, then to use min_int(), if you need to min floats, then to use min_float(), and otherwise, use min(). Bonus point: max_int(), min_int(), max_float(), min_float() are all easier to annotate.
Why would we need to do that? The proposed `open_text()` builtin solves an actual problem with opening files on one platform. Is there an equivalent issue with some platform where min() and max() misbehave by default with ints and floats? If not, then your analogy is invalid. If so, please raise a bug on the tracker. Adding this proposed `open_text` function does not require us to add multiple redundant functions that solve no problems.
So `open_text()` can provide better developer experience, without waiting 10 years.
Except that in 10 years, when the default encoding is finally changed, open_text() is a useless function, which now needs to be deprecated and all the fun process repeated again.
It won't be useless. It will still work as well as it ever did, so useful. It might be redundant, in which case we could deprecate it in documentation and take no further action until Python 5000. -- Steve
Chris Angelico writes:
On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki <songofacandy@gmail.com> wrote:
## 1. Add `io.open_text()`, builtin `open_text()`, and `pathlib.Path.open_text()`.
All functions are same to `io.open()` or `Path.open()`, except:
* Default encoding is "utf-8".
I wonder if it might not be better to remove the encoding parameter for this version. Further comments below.
* "b" is not allowed in the mode option.
I *really* don't like this, because it implies that open() will open in binary mode.
I doubt that will be a common misunderstanding, as long as 'open_text' is documented as a convenience wrapper for 'open' aimed primarily at Windows programmers.
How do you think about this idea? Is this worth enough to add a new built-in function?
Highly dubious.
I won't go so far as "highly", but yeah, dubious to me. In my own environment, while I still see Shift JIS data quite a bit, the rule is that this or that correspondent sends it to me. While a lot of the University infrastructure used to default to Shift JIS, it now defaults to UTF-8. So I don't have a consistent rule by "kind of data", ie, which scripts use 'open_text' and which 'open'. If the script processes data from "JIS users", it needs to accept a command-line flag because other users *will* be sending that kind of data in UTF-8. Naoki's mileage may vary. See below for additional comments.
I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
I expect there are several bodies of users who will experience that as quite obnoxious for a long time to come. I *still* see a ton of stuff that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China gb18030 isn't just a good idea, it's the law. (OK, the precise statement of the law is "must support", not "must use", but my Chinese students all default to GB.) The problem is that these users use some software that will create text in a national language encoding by default and other that use UTF-8 by default. So I guess Naoki's hope is that "when I'm processing Microsoft/Oracle-generated data, I use 'open_text', when it's local software I use 'open'" becomes an easy and natural reponse in such environments. We don't see very many Asian language users on the python-* lists. We see a few more Russian users, I suspect quite a few Hebrew and Indic users, maybe a few Arabic users. So we should listen very carefully to the few we do have where they come from tiny minorities of python-* subscribers.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward incompatibility with just about every script in existence?
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
The problem is that on Windows there are a lot of installations that continue to use non-UTF-8 encodings enough that users set their preferred encoding that way. I guess that folks where the majority of their native-language alphabet is drawn from ASCII are by now almost all using UTF-8 by default, but this is not so for East Asians (who almost all still use a mixture of several encodings every day because email still often defaults to national standard encodings). I can't speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be surprised if they're somewhere in the middle. Naoki can document that "open(..., encoding='...')" is strongly preferred to 'open_text'. Maybe a better name is "open_utf8", to discourage people who want to use non-default encodings, or programmatically chosen encodings, in that function. As someone who avoids Windows like the plague, I have no real sense of how important this is, and I like your argument from first principles. So on net, I guess I'm +/- 0 only because Naoki thinks it important enough to spend quite a bit of skull sweat and effort on this. Steve
On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
I expect there are several bodies of users who will experience that as quite obnoxious for a long time to come. I *still* see a ton of stuff that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China gb18030 isn't just a good idea, it's the law. (OK, the precise statement of the law is "must support", not "must use", but my Chinese students all default to GB.)
But "UTF-8 as the default if you don't specify an encoding" doesn't stop you from using all those other encodings. The only change is that, if you don't specify an encoding, you get a cross-platform consistent default that can be easily described, rather than one which depends on system settings.
The problem is that these users use some software that will create text in a national language encoding by default and other that use UTF-8 by default. So I guess Naoki's hope is that "when I'm processing Microsoft/Oracle-generated data, I use 'open_text', when it's local software I use 'open'" becomes an easy and natural reponse in such environments.
Exactly, so no single default will work. Is there an easy way to say open("filename", encoding="use my system default") ? Currently encoding=None does that, and maybe that can be retained (just with the default becoming "utf-8"), or maybe some other keyword can be used. But that should cover the situations where you specifically *want* a platform-dependent selection.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward incompatibility with just about every script in existence?
Or for a large number of them, sudden cross-platform compatibility that they didn't previously have. This is *fixing a bug* for many scripts.
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
The problem is that on Windows there are a lot of installations that continue to use non-UTF-8 encodings enough that users set their preferred encoding that way. I guess that folks where the majority of their native-language alphabet is drawn from ASCII are by now almost all using UTF-8 by default, but this is not so for East Asians (who almost all still use a mixture of several encodings every day because email still often defaults to national standard encodings). I can't speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be surprised if they're somewhere in the middle.
So Windows is being a pain in the behind, once again, because it doesn't move forward. File names on Mac OS and most Linux systems will be in UTF-8, regardless of your chosen language. Why stick to other encodings as the default? (I repeat: I am NOT advocating abolishing support for all other encodings. The ONLY thing I want to see is that UTF-8 becomes the default.)
Naoki can document that "open(..., encoding='...')" is strongly preferred to 'open_text'. Maybe a better name is "open_utf8", to discourage people who want to use non-default encodings, or programmatically chosen encodings, in that function.
TBH I don't think a separate built-in is of value here, but perhaps it'd be beneficial as an alternative to the wall-of-text help info that open() has. But I do rather like Random's and Steve's suggestion that the alternate function be specifically documented as magic. It'd actually tie in very nicely with a change of default: open() does what it's explicitly told, and has cross-platform defaults, but open_sesame() probes the file to try to guess at its encoding, attempting to use a platform-specific eight bit encoding if applicable. It'd "just work" for reading most text files, regardless of their source, as long as they came from this current computer. (All bets are off anyway if they came from some other system and are in an eight-bit encoding.) ChrisA
On Sat, Jan 23, 2021 at 11:59:12PM +1100, Chris Angelico wrote:
So Windows is being a pain in the behind, once again, because it doesn't move forward.
*cough* That would be called "backwards compatibility" :-) Microsoft's attitude towards backwards compatibility is probably even stricter than ours.
File names on Mac OS and most Linux systems will be in UTF-8, regardless of your chosen language. Why stick to other encodings as the default?
Aren't we talking about the file *contents*, not the file names? The file name depends on the file system, not the OS. On Mac OS, the file system used until High Sierra was HFS+, where file names are UTF-16. I expect that there will still be many Mac systems with HFS+ file systems. After High Sierra, the default file system shifted to APFS which does use UTF-8. Linux file systems such as ext4 are bytes. Any UTF-8 support is enforced by the desktop manager or shell, not the file system, and so can be subverted, either deliberately or accidently (mojibake). -- Steve
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this. Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
On Sat, Jan 23, 2021 at 2:43 PM Random832 <random832@fastmail.com> wrote:
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this.
Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
I meant Notepad (and VS code) use UTF-8 without BOM when creating new text file. Students learning Python can not read it with `open()`. -- Inada Naoki <songofacandy@gmail.com>
On Sat, Jan 23, 2021, at 05:06, Inada Naoki wrote:
On Sat, Jan 23, 2021 at 2:43 PM Random832 <random832@fastmail.com> wrote:
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this.
Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
I meant Notepad (and VS code) use UTF-8 without BOM when creating new text file. Students learning Python can not read it with `open()`.
Right, I was simply suggesting it might be worthwhile to target "be able to open all files that notepad can open" as the goal rather than simply defaulting to UTF8-no-BOM only, which requires a little more sophistication than just a default encoding.
On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this.
Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
I like Random's idea. If we add a new "open text file" builtin function, we should seriously consider having it attempt to auto-detect the encoding. It need not be as sophisticated as `chardet`. That auto-detection behaviour could be enough to differentiate it from the regular open(), thus solving the "but in ten years time it will be redundant and will need to be deprecated" objection. Having said that, I can't say I'm very keen on the name "open_text", but I can't think of any other bikeshed colour I prefer. -- Steve
Steven D'Aprano writes:
On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this.
Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
I like Random's idea. If we add a new "open text file" builtin function, we should seriously consider having it attempt to auto-detect the encoding. It need not be as sophisticated as `chardet`.
It definitely should not be as sophisticated as chardet. Detection of ISO 8859, ISO 2022, and EUC family encodings is reliable as long as you know that only one of each family is going to be used. But you cannot easily tell which of the many ISO 8859 (also Windows-12xx) family are present, and similarly for the other families. I see very little use in detecting the BOMs. I haven't seen a UTF-16 BOM in the wild in a decade (as usual for me, that's Japan-specific, and may be limited to the academic community as well), and the UTF-8 BOM is a no-op if the default is UTF-8 anyway. I'm definitely leaning to the suggestion I made elsewhere (if it's adopted at all): force UTF-8, and name it 'open_utf8'. Steve
On 23Jan2021 22:00, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
I see very little use in detecting the BOMs. I haven't seen a UTF-16 BOM in the wild in a decade (as usual for me, that's Japan-specific, and may be limited to the academic community as well), and the UTF-8 BOM is a no-op if the default is UTF-8 anyway.
I thought I'd seen them on Windows text files within the last year or so (I don't use Windows often, so this is happenstance from receiving some data, not an observation of the Windows ecosystem; my recollection is that it was a UTF16 CSV file.) But BOMs may be commonplace. This isn't a text file example, ut the ISO14496 standard (the basis for all MOV and MP4 files) has a text field type which may be UTF-16LE, UTF16BE or UTF-8, detected by a BOM of the right flavour for UTF16 and not BOM implying UTF8. I'm sure this is to accomodate easy writing by various systems. I do not consider the BOM dead, and it is so cheap to recognise that not bothering to do so seems almost mean sprited. Cheers, Cameron Simpson <cs@cskk.id.au>
Cameron Simpson writes:
I thought I'd seen [UTF-16 BOM] on Windows text files within the last year or so (I don't use Windows often, so this is happenstance from receiving some data, not an observation of the Windows ecosystem; my recollection is that it was a UTF16 CSV file.)
OK; my experience is limited.
But BOMs may be commonplace. This isn't a text file example,
I don't care at all about BOMs in specialized protocols in this thread. This thread is about 'open'.
I do not consider the BOM dead, and it is so cheap to recognise that not bothering to do so seems almost mean sprited.
Not if you view it from the point of view of cognitive burden on casual coders. See my reply to Guido.
On Sat, Jan 23, 2021, at 08:00, Stephen J. Turnbull wrote:
I see very little use in detecting the BOMs. I haven't seen a UTF-16 BOM in the wild in a decade (as usual for me, that's Japan-specific, and may be limited to the academic community as well), and the UTF-8 BOM is a no-op if the default is UTF-8 anyway.
It's not *entirely* a no-op, you'd want the decoder to consume the leading BOM rather than returning '\ufeff' on the first read. And AIUI they're much more common on Windows (being able to detect UTF-16 *without* BOMs might be useful as well, but has historically been a source of problems on Windows) - until recently all UTF-8 or UTF-16 files saved with notepad would have them.
I have definitely seen BOMs written by Notepad on Windows 10. Why can’t the future be that open() in text mode guesses the encoding? -- --Guido (mobile)
On Sun, Jan 24, 2021 at 01:32:28AM +0000, MRAB wrote:
On 2021-01-24 01:14, Guido van Rossum wrote:
I have definitely seen BOMs written by Notepad on Windows 10.
Why can’t the future be that open() in text mode guesses the encoding?
"In the face of ambiguity, refuse the temptation to guess."
"Although practicality beats purity." The Zen is like scripture: there's a koan for any position you wish to take :-) If you want to be pedantic, and I certainly do *wink*, providing any default for the encoding parameter is a guess. The encoding of all text files is ambiguous (the intended encoding is metadata which is not recorded in the file format). Most text files on Linux and Mac OS use UTF-8, and many on Windows too, but not *all* so setting the default to UTF-8 is just a guess. I understand that there are good heuristics for auto-detection of encodings which are reliable and used in many other software. If auto-detection is a "guess", its an *educated* guess and not much different from the status quo, which usually guesses correctly on Linux and Mac but often guesses wrongly on Windows. This proposal is to improve the quality of the guess by inspecting the file's contents. For example, a file opened in text mode where every second character is a NULL is *almost certainly* UTF-16. The chances that somebody actually intended to write: H\0e\0l\0l\0o\O \OW\0o\0r\0l\0d\0 rather than "Hello World" is negligible. Before we consider changing the default encoding to "auto-detect", I would like to see some estimate of how many UTF-8 encoded files will be misclassified as something else. That is, if we make this change, how much software that currently guesses UTF-8 correctly (the default encoding is the actual intended encoding) will break because it guesses something else? That surely won't happen with mostly-ASCII files, but I suppose it could happen with some non-English languages? -- Steve
On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum <guido@python.org> wrote:
I have definitely seen BOMs written by Notepad on Windows 10.
Why can’t the future be that open() in text mode guesses the encoding?
I don't like guessing. As a Japanese, I have seen many mojibake caused by the wrong guess. I don't think guessing encoding is not a good part of reliable software. On the other hand, if we add `open_utf8()`, it's easy to ignore BOM: * When reading, use "utf-8-sig". (it can read UTF-8 without bom) * When writing, use "utf-8". Although UTF-8 with BOM is not recommended, and Notepad uses UTF-8 without BOM as default encoding from 1903, UTF-8 with BOM is still used in some cases. For example, Excel reads CSV file with UTF-8 with BOM or legacy encoding. So some CSV files is written with BOM. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Sat, Jan 23, 2021 at 9:22 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Sun, Jan 24, 2021 at 10:17 AM Guido van Rossum <guido@python.org> wrote:
I have definitely seen BOMs written by Notepad on Windows 10.
Why can’t the future be that open() in text mode guesses the encoding?
I don't like guessing. As a Japanese, I have seen many mojibake caused by the wrong guess. I don't think guessing encoding is not a good part of reliable software.
I agree that guessing encodings in general is a bad idea and is an avenue for subtle localization issues - bad things will happen when it guesses wrong, and it will lead to code that works properly on the developer's machine and fails for end users. It makes sense for a text editor to try to guess, because showing the user something is better than nothing (and if it guesses wrong the user can easily see that, and perhaps take some manual action to correct it). It does not make sense for a programming language to guess, because the user cannot easily detect or correct an incorrect guess, and mistakes will tend to be propagated rather than caught. On the other hand, if we add `open_utf8()`, it's easy to ignore BOM:
Rather than introducing a new `open_utf8` function, I'd suggest the following: 1. Deprecate calling `open` for text mode (the default) unless an `encoding=` is specified, and 3 years after deprecation change the default encoding for `open` to "utf-8-sig" for reading and "utf-8" for writing (to ignore a BOM if one exists when reading, but to not create a BOM when writing). 2. At the same time as the deprecation is announced, introduce a new __future__ import named "utf8_open" or something like that, to opt into the future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a file in text mode and no explicit encoding is specified. I think a __future__ import solves the problem better than introducing a new function would. Users who already have a UTF-8 locale (the majority of users on the majority of platforms) could simply turn on the new __future__ import in any files where they're calling open() with no change in behavior, suppressing the deprecation warning. Users who have a non-UTF-8 locale and want to keep opening text files in that non-UTF-8 locale by default can add encoding=locale.getpreferredencoding(False) to retain the old behavior, suppressing the deprecation warning. And perhaps we could make a shortcut for that, like encoding="locale". ~Matt
On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski <godlygeek@gmail.com> wrote:
2. At the same time as the deprecation is announced, introduce a new __future__ import named "utf8_open" or something like that, to opt into the future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a file in text mode and no explicit encoding is specified.
I think a __future__ import solves the problem better than introducing a new function would.
Note that, since this doesn't involve any language or syntax changes, a regular module import would work here - something like "from utf8mode import open", which would then shadow the builtin. Otherwise no change to your proposal - everything else works exactly the same way. ChrisA
On Sat, Jan 23, 2021 at 10:51 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Jan 24, 2021 at 2:46 PM Matt Wozniski <godlygeek@gmail.com> wrote:
2. At the same time as the deprecation is announced, introduce a new __future__ import named "utf8_open" or something like that, to opt into the future behavior of `open` defaulting to utf-8-sig or utf-8 when opening a file in text mode and no explicit encoding is specified.
I think a __future__ import solves the problem better than introducing a new function would.
Note that, since this doesn't involve any language or syntax changes, a regular module import would work here - something like "from utf8mode import open", which would then shadow the builtin. Otherwise no change to your proposal - everything else works exactly the same way.
True - that's an even better idea. That even allows it to be wrapped in a try/except ImportError, allowing someone to write code that's backwards compatible to versions before the new function is introduced. Though it does mean that the new function will need to stick around, even though it will eventually be identical to the builtin open() function. That would also allow the option of introducing a locale_open as well, which would behave as though encoding=locale.getpreferredencoding(False) is the default encoding for files opened in text mode. I can imagine putting both functions in io, and allowing the user to silence the deprecation warning by either opting into the new behavior: from io import utf8_open as open or explicitly declaring their desire for the legacy behavior: from io import locale_open as open ~Matt
Matt Wozniski writes:
Rather than introducing a new `open_utf8` function, I'd suggest the following:
1. Deprecate calling `open` for text mode (the default) unless an `encoding=` is specified,
For that, we should have a sentinel for "system default encoding" (as you acknowledge, but I want to foot-stomp it). The current dance to get that is quite annoying.
I think a __future__ import [of 'open_text' by some name] solves the problem better than introducing a new function would.
Only if you redefine the problem. If the problem is casual coders who want a quick-and-dirty ready-to-bake function to read UTF-8 when their default encodings are something else, then it's builtin or Just Don't -- teach them to copy-paste "encoding='utf-8'" FTW. I'm perfectly happy with "Just Don't" followed by "It's Time to Work on UTF-8 by Default". You'll have to ask Naoki how he feels about that. Your proposal (1. above) is an interesting one for that. Steve
On Sat, Jan 23, 2021, at 22:43, Matt Wozniski wrote:
1. Deprecate calling `open` for text mode (the default) unless an `encoding=` is specified,
I have a suggestion, if this is going to be done: If the third positional argument to open is a string, accept it as encoding instead of buffering. Maybe even allow the fourth to be errors. It might be worthwhile to consider making the other arguments keyword-only - are they ever used positionally in real-world code?
Guido van Rossum writes:
I have definitely seen BOMs written by Notepad on Windows 10.
I'm not clear on what circumstances we care if a UTF-8 file has or doesn't have a UTF-8 signature. Most software doesn't care, it just reads it and spits it back out if it's there and hasn't been edited out. If people are seeing UTF-16 BOMs, that may be worth detecting, depending on how often and how much trouble it is to deal with them. I'm just saying that I never see them. I was pretty careful about saying that my sample is quite restricted. However ...
Why can’t the future be that open() in text mode guesses the encoding?
The medium-term future is UTF-8 in all UIs and public APIs, except for archivists. I think we all agree on that. There are two issues with encoding guessing. The statistically unimportant one (at least for UTFs) is that guessing is guessing. It will get it wrong. The people who want guessing are mostly people who will be hurt most by wrong guesses. Second, and a real issue for design AFAICS: if you introduce detection of other encodings to 'open', the programmer may need to (1) discover that encoding in order to match it on output (open does not return that), or (2) choose the correct encoding on output, which may or may not be the detected one depending on what the next software in the pipeline expects. At that point "in the face of ambiguity" really does bind, "although practicality" notwithstanding. I'm not sure that putting detection into 'open' solves any problems, it just pushes them into other parts of the code. Remark: As I understand it, Naoki's proposal is about the casual coder in a monolingual environment where either defaulting to getpreferredencoding DTRTs or they need UTF-8 because some engineer decided "UTF-8 is the future, and in my project the future is now!" I don't think it's intended to be more general than that, but you'll have to ask him about that. Steve
On 23 Jan 2021, at 11:00, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 12:40:55AM -0500, Random832 wrote:
On Fri, Jan 22, 2021, at 20:34, Inada Naoki wrote:
* Default encoding is "utf-8".
it might be worthwhile to be a little more sophisticated than this.
Notepad itself uses character set detection [it might not be reasonable to do this on the whole file as notepad does, but maybe the first 512 bytes, or the result of read1(512)?] when opening a file of unknown encoding, and msvcrt's "ccs=UTF-8" option to fopen will at least detect at the presence of UTF-8 and UTF-16 BOMs [and treat the file as UTF-16 in the latter case].
I like Random's idea. If we add a new "open text file" builtin function, we should seriously consider having it attempt to auto-detect the encoding. It need not be as sophisticated as `chardet`.
I think that you are going to create a bug magnet if you attempt to auto detect the encoding. First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect. Second problem is that the first N bytes are all in ASCII and only later do you see Windows code page signature (odd lack of utf-8 signature).
That auto-detection behaviour could be enough to differentiate it from the regular open(), thus solving the "but in ten years time it will be redundant and will need to be deprecated" objection.
Having said that, I can't say I'm very keen on the name "open_text", but I can't think of any other bikeshed colour I prefer.
Given the the functions purpose is to open unicode text use a name that reflects that it is the encoding that is set not the mode (binary vs. text). open_unicode maybe? If you are teaching open_text then do you also need to have open_binary? Barry
-- Steve _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/VAWFPI... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Jan 24, 2021 at 2:31 AM Barry Scott <barry@barrys-emacs.org> wrote:
I think that you are going to create a bug magnet if you attempt to auto detect the encoding.
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Second problem is that the first N bytes are all in ASCII and only later do you see Windows code page signature (odd lack of utf-8 signature).
Both can be handled, just as universal newlines can, by remaining in an "uncertain" state. When the file is first opened, we know nothing about its encoding. Once you request that anything be read (eg by pumping the iterator or anything), it reads, as per current status. Then: 1) If it looks like UTF-16, assume UTF-16. Rather than falling for the "Bush hid the facts" issue, this might be restricted to files that start with a BOM. 2) If it's entirely ASCII, decode it as ASCII and stay uncertain. 3) If it can be decoded UTF-8, remember that this is a UTF-8 file, and from there on, error out if anything isn't UTF-8. 4) Otherwise, use the system encoding. On subsequent reads, if we're in ASCII mode, repeat steps 2-4. Until it finds a non-ASCII byte value, it doesn't really matter how it decodes it. Unlike chardet, this can be done completely dependably. I'm not sure what would happen if the system encoding isn't an eight-bit ASCII-compatible one, though. The algorithm might produce some odd results if the file looks like ASCII, but then switches to some incompatible encoding. Can anyone give an example of a current in-use system encoding that would have this issue? How likely is it that you'd get even one line of text that purports to be ASCII? ChrisA
Chris Angelico writes:
Can anyone give an example of a current in-use system encoding that would have [ASCII bytes in non-ASCII text]?
Shift JIS, Big5. (Both can have bytes < 128 inside multibyte characters.) I don't know if Big5 is still in use as the default encoding anywhere, but Shift JIS is, although it's decreasing. For both of those once you encounter a non-ASCII byte you can just switch over, and none of the previous text was mis-decoded. But that's only if you *know* the language was Japanese (respectively Chinese). Remember, there is no encoding that can be distinguished from ISO 8859-1 (and several other Latin encodings) simply based on the bytes found, since it uses all 256 bytes.
How likely is it that you'd get even one line of text that purports to be ASCII?
Program source code where the higher-level functions (likely to contain literal strings) come late in the file are frequently misdetected based on the earlier bytes. Steve
On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
Can anyone give an example of a current in-use system encoding that would have [ASCII bytes in non-ASCII text]?
Shift JIS, Big5. (Both can have bytes < 128 inside multibyte characters.) I don't know if Big5 is still in use as the default encoding anywhere, but Shift JIS is, although it's decreasing.
Sorry, let me clarify. Can anyone give an example of a current system encoding (ie one that is likely to be the default currently used by open()) that can have byte values below 128 which do NOT mean what they would mean in ASCII? In other words, is it possible to read in a section of a file, think that it's ASCII, and then find that you decoded it wrongly?
For both of those once you encounter a non-ASCII byte you can just switch over, and none of the previous text was mis-decoded.
Good to know, so these two won't be a problem. I'm assuming here that there is a *single* default system encoding, meaning that the automatic handler has only three cases to worry about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system encoding.
But that's only if you *know* the language was Japanese (respectively Chinese). Remember, there is no encoding that can be distinguished from ISO 8859-1 (and several other Latin encodings) simply based on the bytes found, since it uses all 256 bytes.
Right, but as long as there's only one system encoding, that's not our problem. If you're on a Greek system and you want to decode ISO-8859-9 text, you have to state that explicitly. For the situations where you want heuristics based on byte distributions, there's always chardet.
How likely is it that you'd get even one line of text that purports to be ASCII?
Program source code where the higher-level functions (likely to contain literal strings) come late in the file are frequently misdetected based on the earlier bytes.
Yup; and the real question is whether anything would have been decoded incorrectly. If you read in a bunch of ASCII-only text and yield it to the app, and then come across something that proves that the file is not UTF-8, then as far as I am aware, you won't have to un-yield any of the previous text - it'll all have been correctly decoded. In theory, UTF-16 without a BOM can consist entirely of byte values below 128, and that's an absolute pain. But there's no solution to that, other than demanding a BOM (or hoping that the first few characters are all ASCII, so you can see "H\0e\0l\0l\0o\0", which I wouldn't call reliable, although your odds probably aren't that bad in real-world cases). ChrisA
On Sun, Jan 24, 2021 at 10:00:47PM +1100, Chris Angelico wrote:
On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
Can anyone give an example of a current in-use system encoding that would have [ASCII bytes in non-ASCII text]?
Shift JIS, Big5. (Both can have bytes < 128 inside multibyte characters.) I don't know if Big5 is still in use as the default encoding anywhere, but Shift JIS is, although it's decreasing.
Sorry, let me clarify.
Can anyone give an example of a current system encoding (ie one that is likely to be the default currently used by open()) that can have byte values below 128 which do NOT mean what they would mean in ASCII? In other words, is it possible to read in a section of a file, think that it's ASCII, and then find that you decoded it wrongly?
I believe that IBM mainframes such as the Z series still use EBCDIC. Python for z/OS has EBCDIC/UTF interoperability as a selling point. I think that just means the codec module :-) https://www.ibm.com/products/open-enterprise-python-zos -- Steve
On 1/24/21 6:00 AM, Chris Angelico wrote:
Sorry, let me clarify.
Can anyone give an example of a current system encoding (ie one that is likely to be the default currently used by open()) that can have byte values below 128 which do NOT mean what they would mean in ASCII? In other words, is it possible to read in a section of a file, think that it's ASCII, and then find that you decoded it wrongly?
EBCDIC is one big option. There are also some National Character sets which change a couple of the lower 128 characters for use with characters that language needed. (This was the cause of adding Trigraphs to C, to provide a way enter those characters on systems that didn't have those characters. One common example was a Japanese character set that replaced \ with the Yen sign (and a few others) and then used some above 128 codes for multi-byte sequences. Users of such systems just got used to use the Yen sign as the path separator. The EBCDIC cases would likely be well know on those systems, and planned for. Having a system with a few of the lower 128 being substituted for could be a bigger surprise for a system. -- Richard Damon
Chris Angelico writes:
Can anyone give an example of a current system encoding (ie one that is likely to be the default currently used by open()) that can have byte values below 128 which do NOT mean what they would mean in ASCII? In other words, is it possible to read in a section of a file, think that it's ASCII, and then find that you decoded it wrongly?
Japanese Shift JIS, as mentioned by Richard. The Japanese just redefine the glyph used for Windows paths and character escapes to be the yen sign. So it's a total muddle, because they also use that for the yen sign. They also use a broken vertical bar for the pipe symbol, but the visual similarity there is so strong that you have to know a *lot* of computational Japanese to realize that they're different characters (they are, in JIS, but nobody cares -- there's almost never a reason to use both).
I'm assuming here that there is a *single* default system encoding, meaning that the automatic handler has only three cases to worry about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system encoding.
Sure that handles a lot of cases ... but the vast majority are already handled with just the system encoding and UTF-8. In my experience the UTF-16 cases are not going to be the majority of what's left. YMMV.
Right, but as long as there's only one system encoding, that's not our problem. If you're on a Greek system and you want to decode ISO-8859-9 text, you have to state that explicitly. For the situations where you want heuristics based on byte distributions, there's always chardet.
But that's the big question. If you're just going to fall back to chardet, you might as well start there. No? Consider: if 'open' detects the encoding for you, *you can't find out what it is*. 'open' has no facility to tell you! As somebody else pointed out, if you're writing a text editor, autodetection makes a lot of sense. You just provide a facility for the user to chose something different and reread the file. But if you're running non-interactive, it's much harder to recover -- and 'open' can't do it for you.
Program source code where the higher-level functions (likely to contain literal strings) come late in the file are frequently misdetected based on the earlier bytes.
Yup; and the real question is whether anything would have been decoded incorrectly.
If I recall correctly there are several Latin-1 characters in UTF-8 which are plausible Windows 125x digraphs. So, yes, it's quite possible.
If you read in a bunch of ASCII-only text and yield it to the app, and then come across something that proves that the file is not UTF-8, then as far as I am aware, you won't have to un-yield any of the previous text - it'll all have been correctly decoded.
Not if it's UTF-16. And again, if you put the detection logic in 'open', once you've yielded anything to the main logic *it's too late to change your mind*.
In theory, UTF-16 without a BOM can consist entirely of byte values below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana" syllabary is composed of 2 printing ASCII characters (including SPC). A large fraction of the Han ideographs satisfy that condition, and I wouldn't be surprised if a majority of the 1000 most common ones do. (Not a good bet because half of the ideographs have a low byte > 127, but the order of characters isn't random, so if you get a couple of popular radicals that have 50 or so characters in a group in that range, you'd be much of the way there.)
But there's no solution to that,
Well, yes, but that's my line. ;-)
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
Right, but as long as there's only one system encoding, that's not our problem. If you're on a Greek system and you want to decode ISO-8859-9 text, you have to state that explicitly. For the situations where you want heuristics based on byte distributions, there's always chardet.
But that's the big question. If you're just going to fall back to chardet, you might as well start there. No? Consider: if 'open' detects the encoding for you, *you can't find out what it is*. 'open' has no facility to tell you!
Isn't that what file objects have attributes for? You can find out, for instance, what newlines a file uses, even if it's being autodetected.
In theory, UTF-16 without a BOM can consist entirely of byte values below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana" syllabary is composed of 2 printing ASCII characters (including SPC). A large fraction of the Han ideographs satisfy that condition, and I wouldn't be surprised if a majority of the 1000 most common ones do. (Not a good bet because half of the ideographs have a low byte > 127, but the order of characters isn't random, so if you get a couple of popular radicals that have 50 or so characters in a group in that range, you'd be much of the way there.)
But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no way for the autodetection to recognize them. That's why, in my quickly-whipped-up algorithm above, I basically had it assume that no BOM means not UTF-16. After all, there's no way to know whether it's UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point of it), so IMO it's not unreasonable to assert that all files that don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the ASCII-compatible detection method. (Of course, this is *ONLY* if you don't specify an encoding. That part won't be going away.) ChrisA
On 2021-01-24 17:04, Chris Angelico wrote:
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
Right, but as long as there's only one system encoding, that's not our problem. If you're on a Greek system and you want to decode ISO-8859-9 text, you have to state that explicitly. For the situations where you want heuristics based on byte distributions, there's always chardet.
But that's the big question. If you're just going to fall back to chardet, you might as well start there. No? Consider: if 'open' detects the encoding for you, *you can't find out what it is*. 'open' has no facility to tell you!
Isn't that what file objects have attributes for? You can find out, for instance, what newlines a file uses, even if it's being autodetected.
In theory, UTF-16 without a BOM can consist entirely of byte values below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana" syllabary is composed of 2 printing ASCII characters (including SPC). A large fraction of the Han ideographs satisfy that condition, and I wouldn't be surprised if a majority of the 1000 most common ones do. (Not a good bet because half of the ideographs have a low byte > 127, but the order of characters isn't random, so if you get a couple of popular radicals that have 50 or so characters in a group in that range, you'd be much of the way there.)
But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no way for the autodetection to recognize them. That's why, in my quickly-whipped-up algorithm above, I basically had it assume that no BOM means not UTF-16. After all, there's no way to know whether it's UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point of it), so IMO it's not unreasonable to assert that all files that don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the ASCII-compatible detection method.
(Of course, this is *ONLY* if you don't specify an encoding. That part won't be going away.)
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's probably UTF16-BE and if you see patterns like b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE. You could also look for, say, sequences of Latin characters and sequences of Han characters.
On 1/24/21 1:18 PM, MRAB wrote:
On 2021-01-24 17:04, Chris Angelico wrote:
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes: > Right, but as long as there's only one system encoding, that's not > our problem. If you're on a Greek system and you want to decode > ISO-8859-9 text, you have to state that explicitly. For the > situations where you want heuristics based on byte distributions, > there's always chardet.
But that's the big question. If you're just going to fall back to chardet, you might as well start there. No? Consider: if 'open' detects the encoding for you, *you can't find out what it is*. 'open' has no facility to tell you!
Isn't that what file objects have attributes for? You can find out, for instance, what newlines a file uses, even if it's being autodetected.
> In theory, UTF-16 without a BOM can consist entirely of byte values > below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana" syllabary is composed of 2 printing ASCII characters (including SPC). A large fraction of the Han ideographs satisfy that condition, and I wouldn't be surprised if a majority of the 1000 most common ones do. (Not a good bet because half of the ideographs have a low byte > 127, but the order of characters isn't random, so if you get a couple of popular radicals that have 50 or so characters in a group in that range, you'd be much of the way there.)
> But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no way for the autodetection to recognize them. That's why, in my quickly-whipped-up algorithm above, I basically had it assume that no BOM means not UTF-16. After all, there's no way to know whether it's UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point of it), so IMO it's not unreasonable to assert that all files that don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the ASCII-compatible detection method.
(Of course, this is *ONLY* if you don't specify an encoding. That part won't be going away.)
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's probably UTF16-BE and if you see patterns like b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and sequences of Han characters.
Yes, if you happen to see that sort of pattern, you could perhaps make a guess, but since part of the goal is to not need to read ahead much of the file, it doesn't become a very reliable test to confirm UTF16 file in case they don't begin with Latin-1 characters. -- Richard Damon
On Sun, Jan 24, 2021, at 13:18, MRAB wrote:
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's probably UTF16-BE and if you see patterns like b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and sequences of Han characters.
This is dangerous, as Microsoft discovered: a sequence of ASCII latin characters can look a lot like a sequence of UTF-16 Han characters. On Windows, Notepad always writes UTF-16 with BOM, even though it now writes UTF-8 without it by default. Probably the winning combination is "if there is a UTF-16 BOM, it's UTF-16, else if first few non-ASCII bytes encountered are valid UTF-8, it's UTF-8", otherwise it's the system default 'ANSI' locale. The one problem with that is what to do if something like a pipe or a socket gets a sequence of bytes that are a valid *partial* UTF-8 character, then doesn't get any more data for a while. It's unacceptable to have to wait for more data before interpreting data that has been read. Notepad has the luxury of only working on ordinary files, and being able to scan the whole file before making a decision about the character set [I believe it mmaps the file rather than using ordinary open/read calls].
Chris Angelico writes:
Isn't that what file objects have attributes for?
You're absolutely right. Not sure what I was thinking. (Note: not an excuse for my brain bubble, but Path.read_text and Path.read_binary do have this problem because they return str and bytes respectively.)
Do you get files that lack the BOM?
As I wrote earlier, I don't get UTF-16 text files at all. You'll have to ask somebody else. I'm just pointing out that it's pretty likely that if they exist, there are languages that are likely to not distinguish ASCII from UTF-16 in some files without a (fragile) statistical analysis of byte frequencies. Do you actually face the problem of receiving data that should be decoded one way but Python does something different by default? Or are you just tired of hearing about the problems of people who can't "just assume UTF-8 and wish Python would, too"?
so IMO it's not unreasonable to assert that all files that don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the ASCII-compatible detection method.
As I've said before, I think Naoki's suggestion is aimed at something different: the user for whom getpreferredencoding normally DTRTs but has streams that they know are UTF-8 and want a simple obvious way to read and write them. That is the usual case in my experience. As of now, Guido and Naoki have agreed to document "encoding='utf-8'" and drop 'open_text', so I think the discussion is moot, unless somebody really wants to push autodetection of encodings. If somebody has a different experience, I'd like to hear about it. But note that my experience (and Naoki's) is special: in Japan we encounter at least three different encodings of Japanese daily in plain text (ISO-2022-JP in mail, UTF-8 and Shift-JIS in local files). So if anybody is likely to experience the need, I believe we are. Steve
On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
I think that you are going to create a bug magnet if you attempt to auto detect the encoding.
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Can you use `open('filename')` to read a pipe? Is blocking a problem in practice? If you try to open a network file, that could block too, if there are network issues. And since you're likely to follow the open with a read, the read is likely to block. So over all I don't think that blocking is an issue.
Second problem is that the first N bytes are all in ASCII and only later do you see Windows code page signature (odd lack of utf-8 signature).
UTF-8 is a strict superset of ASCII, so if the file is actually ASCII, there is no harm in using UTF-8. The bigger issue is if you have N bytes of pure ASCII followed by some non-UTF superset, such as one of the ISO-8859-* encodings. So you end up detecting what you think is ASCII/UTF-8 but is actually some legacy encoding. But if N is large, say 512 bytes, that's unlikely in practice.
That auto-detection behaviour could be enough to differentiate it from the regular open(), thus solving the "but in ten years time it will be redundant and will need to be deprecated" objection.
Having said that, I can't say I'm very keen on the name "open_text", but I can't think of any other bikeshed colour I prefer.
Given the the functions purpose is to open unicode text use a name that reflects that it is the encoding that is set not the mode (binary vs. text).
open_unicode maybe?
I guess that depends on whether the auto-detection is intended to support non-Unicode legacy encodings or not.
If you are teaching open_text then do you also need to have open_binary?
No. There are no frustrating, difficult, platform-specific encoding issues when reading binary files. Bytes are bytes. -- Steve
On Mon, Jan 25, 2021 at 12:33 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
I think that you are going to create a bug magnet if you attempt to auto detect the encoding.
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Can you use `open('filename')` to read a pipe?
Yes. You can even use it with stdin:
open("/proc/self/fd/0").read(1) a 'a'
The second line was me typing something, even though I was otherwise at the REPL.
Is blocking a problem in practice? If you try to open a network file, that could block too, if there are network issues. And since you're likely to follow the open with a read, the read is likely to block. So over all I don't think that blocking is an issue.
Definitely could be a problem if you read too much just for the sake of autodetection. It needs to be possible to do everything with an absolute minimum of reading.
Second problem is that the first N bytes are all in ASCII and only later do you see Windows code page signature (odd lack of utf-8 signature).
UTF-8 is a strict superset of ASCII, so if the file is actually ASCII, there is no harm in using UTF-8.
The bigger issue is if you have N bytes of pure ASCII followed by some non-UTF superset, such as one of the ISO-8859-* encodings. So you end up detecting what you think is ASCII/UTF-8 but is actually some legacy encoding. But if N is large, say 512 bytes, that's unlikely in practice.
There's no problem if you think it's ASCII, so the only problem would be if you start thinking that it's UTF-8 and then discover that it isn't. The scheme used by UTF-8 is designed such that this is highly unlikely with random data or actual text in an eight-bit encoding, so it's most likely to be broken UTF-8 than legit ISO-8859-X. ChrisA
On 2021-01-25 at 00:29:41 +1100, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
I think that you are going to create a bug magnet if you attempt to auto detect the encoding.
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Can you use `open('filename')` to read a pipe?
Yes. Named pipes are files, at least on POSIX. And no. Unnamed pipes are identified by OS-level file descriptors, so you can't open them with open('filename'), but you can open them with os.fdopen. Once opened, such data sources "should be" interchangeable.
Is blocking a problem in practice? If you try to open a network file, that could block too, if there are network issues. And since you're likely to follow the open with a read, the read is likely to block. So over all I don't think that blocking is an issue.
If open blocks too many bytes, then my application never gets to respond unless enough data comes through the pipe. Consider protocols like FTP and SMTP, where commands and responses are often only handfuls of bytes long. OTOH, if I'm opening a file (or a pipe) for such a protocol, then both ends should know the encoding ahead of time and there's no need to guess.
On Sun, Jan 24, 2021 at 9:53 AM <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
On 2021-01-25 at 00:29:41 +1100, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Can you use `open('filename')` to read a pipe?
Yes. Named pipes are files, at least on POSIX.
And no. Unnamed pipes are identified by OS-level file descriptors, so you can't open them with open('filename'),
The `open` function takes either a file path as a string, or a file descriptor as an integer. So you can use `open` to read an unnamed pipe or a socket.
Is blocking a problem in practice? If you try to open a network file,
that could block too, if there are network issues. And since you're likely to follow the open with a read, the read is likely to block. So over all I don't think that blocking is an issue.
If open blocks too many bytes, then my application never gets to respond unless enough data comes through the pipe.
It's possible to do a `f.read(1)` on a file opened in text mode. If the first two bytes of the file are 0xC2 0x99, that's either ™ if the file is UTF-8, or 슙 if the file is UTF-16BE, or 駂 if the file is UTF-16LE. And `f.read(1)` needs to pick one of those and return it immediately. It can't wait for more information. The contract of `read` is "Read from underlying buffer until we have n characters or we hit EOF." A call to `read(1)` cannot keep blocking after the first character was received to decide what encoding to decode it as; that would be backwards incompatible, and it might block forever if the sender only sends one character before waiting for a response. ~Matt
On Sun, Jan 24, 2021 at 10:43:54PM -0500, Matt Wozniski wrote:
On Sun, Jan 24, 2021 at 9:53 AM <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
On 2021-01-25 at 00:29:41 +1100, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Jan 23, 2021 at 03:24:12PM +0000, Barry Scott wrote:
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Can you use `open('filename')` to read a pipe?
Yes. Named pipes are files, at least on POSIX.
And no. Unnamed pipes are identified by OS-level file descriptors, so you can't open them with open('filename'),
The `open` function takes either a file path as a string, or a file descriptor as an integer. So you can use `open` to read an unnamed pipe or a socket.
Okay, but I was asking about using open with a filename string. In any case, the existence of named pipes answers my question. [...]
It's possible to do a `f.read(1)` on a file opened in text mode. If the first two bytes of the file are 0xC2 0x99, that's either ™ if the file is UTF-8, or 슙 if the file is UTF-16BE, or 駂 if the file is UTF-16LE.
Or  followed by the SGC control code in Latin-1. Or ™ in Windows-1252, or ¬ô in MacRoman. Etc.
And `f.read(1)` needs to pick one of those and return it immediately. It can't wait for more information. The contract of `read` is "Read from underlying buffer until we have n characters or we hit EOF."
In text mode, reads are always buffered: https://docs.python.org/3/library/functions.html#open so `f.read(1)` will read as much as needed, so long as it only returns a single character. A typical buffer size is 4096 bytes, or more. In any case, I believe the intention of this proposal is for *open*, not read, to perform the detection. -- Steve
On Mon, Jan 25, 2021, 4:25 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jan 24, 2021 at 10:43:54PM -0500, Matt Wozniski wrote:
And `f.read(1)` needs to pick one of those and return it immediately. It can't wait for more information. The contract of `read` is "Read from underlying buffer until we have n characters or we hit EOF."
In text mode, reads are always buffered:
https://docs.python.org/3/library/functions.html#open
so `f.read(1)` will read as much as needed, so long as it only returns a single character.
Text mode files are always backed by a buffer, yes, but that's not relevant. My point is that `f.read(1)` must immediately return a character if one exists in the buffer. It can't wait for more data to get buffered if there is already a buffered character, as that would be a backwards incompatible change that would badly break line based protocols like FTP, SMTP, and POP. Up until now, `f.read(1)` has always read bytes from the underlying file descriptor into the buffer until it has one full character, and immediately returned it. And this is user facing behavior. Imagine an echo server that reads 1 character at a time and echoes it back, forever. The client will only ever send 1 character at a time, so if an eight bit locale encoding is in use the client will only send one byte before waiting for a response. As things stand today this works. If encoding detection were added and the server's call to `f.read(1)` could decide it doesn't know how to decode the first byte it gets and to block until more data comes in, that would be a deadlock, since the client isn't sending more. A typical buffer size is 4096 bytes, or more. Sure, but that doesn't mean that much data is always available. If something has written less than that, it's not reasonable to block until more data can be buffered in places where up until now no blocking would have occurred. Not least because no more data will necessarily ever come. And if it were to instead make its decisions based on what has been buffered already, without ever blocking, then the behavior becomes nondeterministic: it could return a different character based on how much data the OS returned in the first read syscall. In any case, I believe the intention of this proposal is for *open*, not
read, to perform the detection.
If that's the case, named pipes are a perfect example of why that's impossible. It's perfectly normal to open a named pipe that contains no data, and that won't until you trigger some action (say, spawning a child process that will write to it). You can't auto detect the encoding of an empty pipe, and you can't make open block until data arrives because it's entirely possible data will never arrive if open blocks.
Thanks Matt for the detailed explanation for why we cannot change `open` to do encoding detection by default. I think that should answer Guido's question. It still leaves open the possibility of: - a new mode to open() that opts-in to encoding detection; - a new built-in function that is only used for opening text files (not pipes) with encoding detection by default; - or a new function that attempts the detection: enc = io.guess_encoding(FILENAME) or 'UTF-8' with open(FILENAME, encoding=enc) as f: ... These may be useful, but I don't think that they are very helpful for solving the problem of naive programmers who don't know anything about encodings trying to open files which are encoded differently from the system encoding. Such users aren't knowledgable enough to know that they should opt-in to encoding detection. If they were, they would probably just set the encoding to "utf-8" in the first place. -- Steve
participants (13)
-
2QdxY4RzWzUUiLuE@potatochowder.com
-
Barry Scott
-
Cameron Simpson
-
Chris Angelico
-
Guido van Rossum
-
Inada Naoki
-
Matt Wozniski
-
MRAB
-
Paul Sokolovsky
-
Random832
-
Richard Damon
-
Stephen J. Turnbull
-
Steven D'Aprano