Add a couple of options to open()'s mode parameter to deal with common text encodings
There's a long ongoing thread with the subject "Make UTF-8 mode more accessible for Windows users." There are two obvious problems with UTF-8 mode. First, it applies to entire installations, or at least entire running scripts, including all imported libraries no matter who wrote them, etc., making it a blunt instrument. Second, the problem on Windows isn't that Python happens to use the wrong default encoding, it's that multiple encodings coexist, and you really do have to think each time you en/decode something about which encoding you ought to use. UTF-8 mode doesn't solve that, it just changes the default. It seems as though most of those commenting in the other thread don't actually use Python on Windows. I do, and I can say it's a royal pain to have to write open(path, encoding='utf-8') all the time. If you could write open(path, 'r', 'utf-8'), that would be slightly better, but the third parameter is buffering, not encoding, and open(path, 'r', -1, 'utf-8') is not very readable. UTF-8 mode is somehow worse, because you now have to decide between writing open(path), and having your script be incompatible with non-UTF-8 Windows installations, or writing open(path, encoding='utf-8'), making your script more compatible but making UTF-8 mode pointless. There's a constant temptation to sacrifice portability for convenience - a temptation that Unix users are familiar with, since they omit encoding='utf-8' all the time. My proposal is to add a couple of single-character options to open()'s mode parameter. 'b' and 't' already exist, and the encoding parameter essentially selects subcategories of 't', but it's annoyingly verbose and so people often omit it. If '8' was equivalent to specifying encoding='UTF-8', and 'L' was equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8 mode), that would go a long way toward making open more convenient in the common cases on Windows, and I bet it would encourage at least some of those developing on Unixy platforms to write more portable code also. For other encodings, you can still use 't' (or '') and the encoding parameter. Note that I am not suggesting that 'L' be equivalent to PEP 597's encoding='locale', because that's specified to behave the same as encoding=None, except that it suppresses the warning. I think that's a terrible idea, because it means that open's behavior still depends on the global UTF-8 mode even if you specify the encoding explicitly. This is really a criticism of PEP 597 and not a part of this proposal as such. I think UTF-8 mode was a bad idea (just like a global "binary mode" that interpreted every mode='r' as mode='rb' would have been), and it should be ignored wherever possible. In particular, encoding='locale' should ignore UTF-8 mode. Then 'L' could and should mean encoding='locale'. Obviously the names '8' and 'L' are debatable. 'L' could be argued to be unnecessary if there's a simple way to achieve the same thing with the encoding parameter (which currently there isn't).
On Fri, Feb 5, 2021 at 10:17 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
'L' could be argued to be unnecessary if there's a simple way to achieve the same thing with the encoding parameter (which currently there isn't).
I'd rather work that one out the opposite way, having encoding="locale" (or encoding="system" or something), and then that distinction won't apply. Whether that's PEP 597 or something else, an encoding parameter is the logical way to do this. The mode parameters have significant impact on the overall behaviour of the resulting file object. If you specify "r", you get something that you can read from; specify "w" and you get something you can write to. With "t", it takes/gives Unicode objects, but with "b" it uses bytes. The encoding parameter controls how it transforms what's on disk into the Unicode strings it returns, but regardless of that value, it will always be returning strings (or accepting strings, when writing). The choice of encoding is, by comparison, a much less significant difference. (Yes, there's the "U" flag for universal newlines, but that's deprecated.) ChrisA
On Thu, Feb 4, 2021 at 3:29 PM Chris Angelico <rosuav@gmail.com> wrote:
With "t", it takes/gives Unicode objects, but with "b" it uses bytes.
Sure, in Python 3, but not in Python 2, or C. Anyway, moral correctness is beside the point. People in point of fact don't write encoding='utf-8' when they should, because it's so much to type. If you had to write binary=True to enable binary mode, fewer people would have bothered to use it in the Python 2 era, and there would have been more portability (and Python 3 transition) problems. There shouldn't have been, but there would have been. Everything about the mode parameter is a sop to convenience. Really you should write open(mode=io.APPEND) or something.
On Fri, Feb 5, 2021 at 10:46 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
On Thu, Feb 4, 2021 at 3:29 PM Chris Angelico <rosuav@gmail.com> wrote:
With "t", it takes/gives Unicode objects, but with "b" it uses bytes.
Sure, in Python 3, but not in Python 2, or C.
Python 2 isn't changing any more now, and we're not proposing changes to C. We're looking at Python 3 here.
Anyway, moral correctness is beside the point. People in point of fact don't write encoding='utf-8' when they should, because it's so much to type. If you had to write binary=True to enable binary mode, fewer people would have bothered to use it in the Python 2 era, and there would have been more portability (and Python 3 transition) problems. There shouldn't have been, but there would have been. Everything about the mode parameter is a sop to convenience. Really you should write open(mode=io.APPEND) or something.
There WERE problems that resulted from people not specifying "b" or "t" when they needed to. I don't think spelling it binary=True would have made any difference. But it sounds like you're arguing for the complete abolition of the character-flag mode parameter, which I can't strongly dispute, other than that it'd be a massive backward compatibility break. The same line of argument says that we shouldn't be expanding it, especially not with something that can be better spelled in another way (the encoding parameter) and is far more restrictive (only two possible values for that parameter). ChrisA
On Thu, Feb 4, 2021, at 18:46, Ben Rudiak-Gould wrote:
On Thu, Feb 4, 2021 at 3:29 PM Chris Angelico <rosuav@gmail.com> wrote:
With "t", it takes/gives Unicode objects, but with "b" it uses bytes.
Sure, in Python 3, but not in Python 2, or C.
Anyway, moral correctness is beside the point. People in point of fact don't write encoding='utf-8' when they should, because it's so much to type.
I'll say again, what if it were accepted as a positional argument? the current third positional argument is "buffering", which is an integer, but I doubt there's even much code that uses it intentionally.
On 2/4/21, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
My proposal is to add a couple of single-character options to open()'s mode parameter. 'b' and 't' already exist, and the encoding parameter essentially selects subcategories of 't', but it's annoyingly verbose and so people often omit it.
If '8' was equivalent to specifying encoding='UTF-8', and 'L' was equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8 mode), that would go a long way toward making open more convenient in the common cases on Windows, and I bet it would encourage at least some of those developing on Unixy platforms to write more portable code also.
A precedent for using the mode parameter is [_w]fopen in MSVC, which supports a "ccs=<encoding>" flag, where "<encoding>" can be "UTF-8", "UTF-16LE", or "UNICODE". --- In terms of using the 'locale', keep in mind that the implementation in Windows doesn't use the current LC_CTYPE locale. It only uses the default locale, which in turn uses the process active (ANSI) code page. The latter is a system setting, unless overridden to UTF-8 in the application manifest (e.g. the manifest that's embedded in "python.exe"). I'd like to see support for a -X option and/or environment variable to make Python in Windows actually use the current locale to get the locale encoding (a real shocker, I know). For example, setlocale(LC_CTYPE, "el_GR") would select "cp1253" (Greek) as the locale encoding, while setlocale(LC_CTYPE, "el_GR.utf-8") would select "utf-8" as the locale encoding. (The CRT supports UTF-8 in locales starting with Windows 10, build 17134, released on 2018-04-03.) At startup, Python 3.8+ calls setlocale(LC_CTYPE, "") to use the default locale, for use with C functions such as mbstowcs(). This allows the default behavior to remain the same, unless the new option also entails attempting locale coercion to UTF-8 via setlocale(LC_CTYPE, ".utf-8"). The following gets the current locale's code page in C: #include <"locale.h"> // ... loc = _get_current_locale(); locinfo = (__crt_locale_data_public *)loc->locinfo; cp = locinfo->_locale_lc_codepage; The "C" locale uses code page 0. C mbstowcs() and wcstombs() handle this case as Latin-1. locale._get_locale_encoding() could instead map it to the process ANSI code page, GetACP(). Also, the CRT displays CP_UTF8 (65001) as "utf8". _get_locale_encoding() should map it to "utf-8" instead of "cp65001".
On Thu, Feb 4, 2021 at 3:19 PM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
There's a long ongoing thread with the subject "Make UTF-8 mode more accessible for Windows users."
There are two obvious problems with UTF-8 mode.
If you don't think UTF-8 mode is helpful then don't use it -- and maybe join that thread and argue that it should NOT be more accessible.
First, it applies to entire installations, or at least entire running scripts, including all imported libraries no matter who wrote them, etc., making it a blunt instrument.
yes, indeed, that is the case, and why that other thread has substantial discussion.
Second, the problem on Windows isn't that Python happens to use the wrong default encoding, it's that multiple encodings coexist, and you really do have to think each time you en/decode something about which encoding you ought to use.
That's the problem with all of Unicode, on all systems -- nothing Windows specific about it.
UTF-8 mode doesn't solve that, it just changes the default.
Not quite -- what UTF-8 mode does is make Python act like it does on virtually every operating system. That is, the default encoding is utf-8, everywhere, every time, regardless of how the system the code is running on is configured. Which solves a substantial problem. and why the goal is for Python eventually to be utf-8 default everywhere on all systems. Frankly, the idea of Python, which is a programming language / runtime environment to use a system setting for text file encoding is a really bad idea. In this age of the internet, the idea that a text file is most likely to be encoded in the same encoding as the system default of the machine it happens to run on is just plain wrong. And it leads to real problems because code that that works just fine on one machine may not work right on another -- on not just "tested on Linux, broken on Windows", but "tested on one Windows machine broken on another" Water under the bridge, but it will take a long time to change the Python defaults, so UTF mode provides a transition: application developers can say that this code will work the same way on all machines if you use UTF-8 mode. Yes, the "right" way to achieve that is to specify an encoding for all text files, but if you, as an application developer, are using packages written by others that may be broken in that way -- you're kind of stuck. It seems as though most of those commenting in the other thread don't
actually use Python on Windows. I do, and
I'm one of those "don't use Windows (or not much) -- but I do write software that I want others to be able to run on Windows.
I can say it's a royal pain to have to write open(path, encoding='utf-8') all the time.
Indeed -- and EVERYONE should be doing that, on all OS's if you want your code to be cross platform. ANy many (most?) don't -- again, that's why UTF-8 mode is useful. If you could write open(path, 'r', 'utf-8'), that would be slightly better,
but the third parameter is buffering, not encoding, and open(path, 'r', -1, 'utf-8') is not very readable.
UTF-8 mode is somehow worse, because you now have to decide between writing open(path), and having your script be incompatible with non-UTF-8 Windows installations,
I personally think that using the "system" encoding is probably never the right choice, but if it is for an application, then what we need is a "system" encoding, as proposed in PEP 597. I think we need that before pushing greater use of UTF-8 mode. I do agree that making it easier to set the encoding would be good in principle, but that most direct way to solve this problem is to make the default utf-8 everywhere, as it already is in most code, and as wrong as it is, a lot of code is already making that assumption.
There's a constant temptation to sacrifice portability for convenience - a temptation that Unix users are familiar with, since they omit encoding='utf-8' all the time.
true, but I think many, if not most, folks do not know that they are making that choice, but rather, not thinking about it, and when it works most of the time, then they're done (I'm sure guilty of that!). Anyway, I think others have said everything I'd say about your specific suggestions, but in short -- yes, it would have been good to make encoding specification easier, but too late now, and if we are making any changes, they should be PEP 597 and ultimately making the default utf-8. - Chris B. -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Fri, Feb 5, 2021 at 8:20 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
It seems as though most of those commenting in the other thread don't actually use Python on Windows. I do, and I can say it's a royal pain to have to write open(path, encoding='utf-8') all the time. If you could write open(path, 'r', 'utf-8'), that would be slightly better, but the third parameter is buffering, not encoding, and open(path, 'r', -1, 'utf-8') is not very readable.
FWIW, I had another idea that adding `open_utf8()` function for same motivation. `open_utf8(filename)` is easier to type than `open(filename, encoding="utf-8")`. But no one support the idea. Everyone think `encoding="utf-8"` is better than this alias function. See this thread. https://mail.python.org/archives/list/python-ideas@python.org/thread/PZUYJ5X... Regards,
participants (6)
-
Ben Rudiak-Gould
-
Chris Angelico
-
Christopher Barker
-
Eryk Sun
-
Inada Naoki
-
Random832