Mailman 3 Add a couple of options to open()'s mode parameter to deal with common text encodings - Python-ideas

4 Feb 2021

      There's a long ongoing thread with the subject "Make UTF-8 mode more
accessible for Windows users."

There are two obvious problems with UTF-8 mode. First, it applies to entire
installations, or at least entire running scripts, including all imported
libraries no matter who wrote them, etc., making it a blunt instrument.
Second, the problem on Windows isn't that Python happens to use the wrong
default encoding, it's that multiple encodings coexist, and you really do
have to think each time you en/decode something about which encoding you
ought to use. UTF-8 mode doesn't solve that, it just changes the default.

It seems as though most of those commenting in the other thread don't
actually use Python on Windows. I do, and I can say it's a royal pain to
have to write open(path, encoding='utf-8') all the time. If you could write
open(path, 'r', 'utf-8'), that would be slightly better, but the third
parameter is buffering, not encoding, and open(path, 'r', -1, 'utf-8') is
not very readable.

UTF-8 mode is somehow worse, because you now have to decide between writing
open(path), and having your script be incompatible with non-UTF-8 Windows
installations, or writing open(path, encoding='utf-8'), making your script
more compatible but making UTF-8 mode pointless. There's a constant
temptation to sacrifice portability for convenience - a temptation that
Unix users are familiar with, since they omit encoding='utf-8' all the time.

My proposal is to add a couple of single-character options to open()'s mode
parameter. 'b' and 't' already exist, and the encoding parameter
essentially selects subcategories of 't', but it's annoyingly verbose and
so people often omit it.

If '8' was equivalent to specifying encoding='UTF-8', and 'L' was
equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8
mode), that would go a long way toward making open more convenient in the
common cases on Windows, and I bet it would encourage at least some of
those developing on Unixy platforms to write more portable code also. For
other encodings, you can still use 't' (or '') and the encoding parameter.

Note that I am not suggesting that 'L' be equivalent to PEP 597's
encoding='locale', because that's specified to behave the same as
encoding=None, except that it suppresses the warning. I think that's a
terrible idea, because it means that open's behavior still depends on the
global UTF-8 mode even if you specify the encoding explicitly. This is
really a criticism of PEP 597 and not a part of this proposal as such. I
think UTF-8 mode was a bad idea (just like a global "binary mode" that
interpreted every mode='r' as mode='rb' would have been), and it should be
ignored wherever possible. In particular, encoding='locale' should ignore
UTF-8 mode. Then 'L' could and should mean encoding='locale'.

Obviously the names '8' and 'L' are debatable.

'L' could be argued to be unnecessary if there's a simple way to achieve
the same thing with the encoding parameter (which currently there isn't).

Add a couple of options to open()'s mode parameter to deal with common text encodings

Ben Rudiak-Gould

Chris Angelico

Ben Rudiak-Gould

Chris Angelico

Random832

Eryk Sun

Christopher Barker

Inada Naoki

tags

participants (6)