On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull email@example.com wrote:
I'd rather focus on just moving to UTF-8 as the default, rather than bringing in a new function - especially with such a confusing name.
I expect there are several bodies of users who will experience that as quite obnoxious for a long time to come. I *still* see a ton of stuff that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China gb18030 isn't just a good idea, it's the law. (OK, the precise statement of the law is "must support", not "must use", but my Chinese students all default to GB.)
But "UTF-8 as the default if you don't specify an encoding" doesn't stop you from using all those other encodings. The only change is that, if you don't specify an encoding, you get a cross-platform consistent default that can be easily described, rather than one which depends on system settings.
The problem is that these users use some software that will create text in a national language encoding by default and other that use UTF-8 by default. So I guess Naoki's hope is that "when I'm processing Microsoft/Oracle-generated data, I use 'open_text', when it's local software I use 'open'" becomes an easy and natural reponse in such environments.
Exactly, so no single default will work.
Is there an easy way to say open("filename", encoding="use my system default") ? Currently encoding=None does that, and maybe that can be retained (just with the default becoming "utf-8"), or maybe some other keyword can be used. But that should cover the situations where you specifically *want* a platform-dependent selection.
What exactly are the blockers on making open(fn) use UTF-8 by default?
Backward incompatibility with just about every script in existence?
Or for a large number of them, sudden cross-platform compatibility that they didn't previously have. This is *fixing a bug* for many scripts.
Can the proposals be written with that as the ultimate goal (even if it's going to take X versions and multiple deprecation phases), rather than aiming for a messy goal where people aren't sure which function to use?
The problem is that on Windows there are a lot of installations that continue to use non-UTF-8 encodings enough that users set their preferred encoding that way. I guess that folks where the majority of their native-language alphabet is drawn from ASCII are by now almost all using UTF-8 by default, but this is not so for East Asians (who almost all still use a mixture of several encodings every day because email still often defaults to national standard encodings). I can't speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be surprised if they're somewhere in the middle.
So Windows is being a pain in the behind, once again, because it doesn't move forward. File names on Mac OS and most Linux systems will be in UTF-8, regardless of your chosen language. Why stick to other encodings as the default?
(I repeat: I am NOT advocating abolishing support for all other encodings. The ONLY thing I want to see is that UTF-8 becomes the default.)
Naoki can document that "open(..., encoding='...')" is strongly preferred to 'open_text'. Maybe a better name is "open_utf8", to discourage people who want to use non-default encodings, or programmatically chosen encodings, in that function.
TBH I don't think a separate built-in is of value here, but perhaps it'd be beneficial as an alternative to the wall-of-text help info that open() has. But I do rather like Random's and Steve's suggestion that the alternate function be specifically documented as magic. It'd actually tie in very nicely with a change of default: open() does what it's explicitly told, and has cross-platform defaults, but open_sesame() probes the file to try to guess at its encoding, attempting to use a platform-specific eight bit encoding if applicable. It'd "just work" for reading most text files, regardless of their source, as long as they came from this current computer. (All bets are off anyway if they came from some other system and are in an eight-bit encoding.)