[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

23 Jan 2021

      On Sat, Jan 23, 2021 at 11:34 PM Stephen J. Turnbull
 wrote:
...
...
I'd rather focus on just moving to UTF-8 as the default, rather
than bringing in a new function - especially with such a confusing
name.
I expect there are several bodies of users who will experience that as
quite obnoxious for a long time to come.  I *still* see a ton of stuff
that is Shift JIS, a fair amount of email in ISO-2022-JP, and in China
gb18030 isn't just a good idea, it's the law.  (OK, the precise
statement of the law is "must support", not "must use", but my Chinese
students all default to GB.)
But "UTF-8 as the default if you don't specify an encoding" doesn't
stop you from using all those other encodings. The only change is
that, if you don't specify an encoding, you get a cross-platform
consistent default that can be easily described, rather than one which
depends on system settings.
...
The problem is that these users use some software that will create
text in a national language encoding by default and other that use
UTF-8 by default.  So I guess Naoki's hope is that "when I'm
processing Microsoft/Oracle-generated data, I use 'open_text', when
it's local software I use 'open'" becomes an easy and natural reponse
in such environments.
Exactly, so no single default will work.

Is there an easy way to say open("filename", encoding="use my system
default") ? Currently encoding=None does that, and maybe that can be
retained (just with the default becoming "utf-8"), or maybe some other
keyword can be used. But that should cover the situations where you
specifically *want* a platform-dependent selection.
...
...
What exactly are the blockers on making open(fn) use UTF-8 by
default?
Backward incompatibility with just about every script in existence?
Or for a large number of them, sudden cross-platform compatibility
that they didn't previously have. This is *fixing a bug* for many
scripts.
...
...
Can the proposals be written with that as the ultimate goal (even if
it's going to take X versions and multiple deprecation phases), rather
than aiming for a messy goal where people aren't sure which function
to use?
The problem is that on Windows there are a lot of installations that
continue to use non-UTF-8 encodings enough that users set their
preferred encoding that way.  I guess that folks where the majority of
their native-language alphabet is drawn from ASCII are by now almost
all using UTF-8 by default, but this is not so for East Asians (who
almost all still use a mixture of several encodings every day because
email still often defaults to national standard encodings).  I can't
speak to Cyrillic, Hebrew, Arabic, Indic languages, but I wouldn't be
surprised if they're somewhere in the middle.
So Windows is being a pain in the behind, once again, because it
doesn't move forward. File names on Mac OS and most Linux systems will
be in UTF-8, regardless of your chosen language. Why stick to other
encodings as the default?

(I repeat: I am NOT advocating abolishing support for all other
encodings. The ONLY thing I want to see is that UTF-8 becomes the
default.)
...
Naoki can document that "open(..., encoding='...')" is strongly
preferred to 'open_text'.  Maybe a better name is "open_utf8", to
discourage people who want to use non-default encodings, or
programmatically chosen encodings, in that function.
TBH I don't think a separate built-in is of value here, but perhaps
it'd be beneficial as an alternative to the wall-of-text help info
that open() has. But I do rather like Random's and Steve's suggestion
that the alternate function be specifically documented as magic. It'd
actually tie in very nicely with a change of default: open() does what
it's explicitly told, and has cross-platform defaults, but
open_sesame() probes the file to try to guess at its encoding,
attempting to use a platform-specific eight bit encoding if
applicable. It'd "just work" for reading most text files, regardless
of their source, as long as they came from this current computer. (All
bets are off anyway if they came from some other system and are in an
eight-bit encoding.)

ChrisA

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Chris Angelico