[Python-ideas] Re: Add a couple of options to open()'s mode parameter to deal with common text encodings

4 Feb 2021

      On 2/4/21, Ben Rudiak-Gould  wrote:
...
My proposal is to add a couple of single-character options to open()'s mode
parameter. 'b' and 't' already exist, and the encoding parameter
essentially selects subcategories of 't', but it's annoyingly verbose and
so people often omit it.
If '8' was equivalent to specifying encoding='UTF-8', and 'L' was
equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8
mode), that would go a long way toward making open more convenient in the
common cases on Windows, and I bet it would encourage at least some of
those developing on Unixy platforms to write more portable code also.
A precedent for using the mode parameter is [_w]fopen in MSVC, which
supports a "ccs=<encoding>" flag, where "<encoding>" can be "UTF-8",
"UTF-16LE", or "UNICODE".

---

In terms of using the 'locale', keep in mind that the implementation
in Windows doesn't use the current LC_CTYPE locale. It only uses the
default locale, which in turn uses the process active (ANSI) code
page. The latter is a system setting, unless overridden to UTF-8 in
the application manifest (e.g. the manifest that's embedded in
"python.exe").

I'd like to see support for a -X option and/or environment variable to
make Python in Windows actually use the current locale to get the
locale encoding (a real shocker, I know). For example,
setlocale(LC_CTYPE, "el_GR") would select "cp1253" (Greek) as the
locale encoding, while setlocale(LC_CTYPE, "el_GR.utf-8") would select
"utf-8" as the locale encoding.

(The CRT supports UTF-8 in locales starting with Windows 10, build
17134, released on 2018-04-03.)

At startup, Python 3.8+ calls setlocale(LC_CTYPE, "") to use the
default locale, for use with C functions such as mbstowcs(). This
allows the default behavior to remain the same, unless the new option
also entails attempting locale coercion to UTF-8 via
setlocale(LC_CTYPE, ".utf-8").

The following gets the current locale's code page in C:

    #include <"locale.h">
    // ...
    loc = _get_current_locale();
    locinfo = (__crt_locale_data_public *)loc->locinfo;
    cp = locinfo->_locale_lc_codepage;

The "C" locale uses code page 0. C mbstowcs() and wcstombs() handle
this case as Latin-1. locale._get_locale_encoding() could instead map
it to the process ANSI code page, GetACP(). Also, the CRT displays
CP_UTF8 (65001) as "utf8". _get_locale_encoding() should map it to
"utf-8" instead of "cp65001".

[Python-ideas] Re: Add a couple of options to open()'s mode parameter to deal with common text encodings

Eryk Sun