Chris Angelico writes:
On Sat, Jan 23, 2021 at 12:37 PM Inada Naoki <songofacandy@gmail.com> wrote:
## 1. Add `io.open_text()`, builtin `open_text()`, and `pathlib.Path.open_text()`.
All functions are same to `io.open()` or `Path.open()`, except:
* Default encoding is "utf-8".
I wonder if it might not be better to remove the encoding parameter for this version. Further comments below.
* "b" is not allowed in the mode option.
I *really* don't like this, because it implies that open() will open in binary mode.
I doubt that will be a common misunderstanding, as long as 'open_text' is documented as a convenience wrapper for 'open' aimed primarily at Windows programmers.
How do you think about this idea? Is this worth enough to add a new built-in function?
Highly dubious.
I won't go so far as "highly", but yeah, dubious to me. In my own environment, while I still see Shift JIS data quite a bit, the rule is that this or that correspondent sends it to me. While a lot of the University infrastructure used to default to Shift JIS, it now defaults to UTF-8. So I don't have a consistent rule by "kind of data", ie, which scripts use 'open_text' and which 'open'. If the script processes data from "JIS users", it needs to accept a command-line flag because other users *will* be sending that kind of data in UTF-8. Naoki's mileage may vary. See below for additional comments.
I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
I expect there are several bodies of users who will experience that as quite obnoxious for a long time to come. I *still* see a ton of stuff that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China gb18030 isn't just a good idea, it's the law. (OK, the precise statement of the law is "must support", not "must use", but my Chinese students all default to GB.) The problem is that these users use some software that will create text in a national language encoding by default and other that use UTF-8 by default. So I guess Naoki's hope is that "when I'm processing Microsoft/Oracle-generated data, I use 'open_text', when it's local software I use 'open'" becomes an easy and natural reponse in such environments. We don't see very many Asian language users on the python-* lists. We see a few more Russian users, I suspect quite a few Hebrew and Indic users, maybe a few Arabic users. So we should listen very carefully to the few we do have where they come from tiny minorities of python-* subscribers.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward incompatibility with just about every script in existence?
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
The problem is that on Windows there are a lot of installations that continue to use non-UTF-8 encodings enough that users set their preferred encoding that way. I guess that folks where the majority of their native-language alphabet is drawn from ASCII are by now almost all using UTF-8 by default, but this is not so for East Asians (who almost all still use a mixture of several encodings every day because email still often defaults to national standard encodings). I can't speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be surprised if they're somewhere in the middle. Naoki can document that "open(..., encoding='...')" is strongly preferred to 'open_text'. Maybe a better name is "open_utf8", to discourage people who want to use non-default encodings, or programmatically chosen encodings, in that function. As someone who avoids Windows like the plague, I have no real sense of how important this is, and I like your argument from first principles. So on net, I guess I'm +/- 0 only because Naoki thinks it important enough to spend quite a bit of skull sweat and effort on this. Steve