On 2/4/21, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
My proposal is to add a couple of single-character options to open()'s mode parameter. 'b' and 't' already exist, and the encoding parameter essentially selects subcategories of 't', but it's annoyingly verbose and so people often omit it.
If '8' was equivalent to specifying encoding='UTF-8', and 'L' was equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8 mode), that would go a long way toward making open more convenient in the common cases on Windows, and I bet it would encourage at least some of those developing on Unixy platforms to write more portable code also.
A precedent for using the mode parameter is [_w]fopen in MSVC, which supports a "ccs=<encoding>" flag, where "<encoding>" can be "UTF-8", "UTF-16LE", or "UNICODE". --- In terms of using the 'locale', keep in mind that the implementation in Windows doesn't use the current LC_CTYPE locale. It only uses the default locale, which in turn uses the process active (ANSI) code page. The latter is a system setting, unless overridden to UTF-8 in the application manifest (e.g. the manifest that's embedded in "python.exe"). I'd like to see support for a -X option and/or environment variable to make Python in Windows actually use the current locale to get the locale encoding (a real shocker, I know). For example, setlocale(LC_CTYPE, "el_GR") would select "cp1253" (Greek) as the locale encoding, while setlocale(LC_CTYPE, "el_GR.utf-8") would select "utf-8" as the locale encoding. (The CRT supports UTF-8 in locales starting with Windows 10, build 17134, released on 2018-04-03.) At startup, Python 3.8+ calls setlocale(LC_CTYPE, "") to use the default locale, for use with C functions such as mbstowcs(). This allows the default behavior to remain the same, unless the new option also entails attempting locale coercion to UTF-8 via setlocale(LC_CTYPE, ".utf-8"). The following gets the current locale's code page in C: #include <"locale.h"> // ... loc = _get_current_locale(); locinfo = (__crt_locale_data_public *)loc->locinfo; cp = locinfo->_locale_lc_codepage; The "C" locale uses code page 0. C mbstowcs() and wcstombs() handle this case as Latin-1. locale._get_locale_encoding() could instead map it to the process ANSI code page, GetACP(). Also, the CRT displays CP_UTF8 (65001) as "utf8". _get_locale_encoding() should map it to "utf-8" instead of "cp65001".