On 2/11/21, M.-A. Lemburg firstname.lastname@example.org wrote:
On 11.02.2021 13:49, Eryk Sun wrote:
Currently, locale.getpreferredencoding(False) is implemented as locale._get_locale_encoding(). This ultimately calls _Py_GetLocaleEncoding(), defined in "Python/fileutils.c". TextIOWrapper() calls this C function to get the encoding to use when encoding=None is passed.
All that seems to be new in Python 3.10. This is not what's happening in Python 3.9. The _get_locale_encoding() function doesn't even exist.
In previous versions, locale.getpreferredencoding(False) is functionally the same. In 3.10, the latter is implemented in C via locale._get_locale_encoding().
Why an env variable ? You could simply open up a ticket to get this fixed, since 3.10 is not released yet.
I thought it would be best to let users/administrators opt in to POSIX behavior. But maybe it should require opting out.
Windows code pages 1252 and 1253 are not the same as ISO-8859-1 and ISO-8859-7. getlocale() is just looking up the encoding of "en_US" and "el_GR" from the mapping in the locale module. That kind of best-guess result isn't right for locale._get_locale_encoding().
The returned values for the encoding look mostly correct to me, except the one for the 'C' locale which should be 'ascii'.
The "C" locale in the Windows CRT uses Latin-1 for LC_CTYPE. This is implemented for mbstowcs() by casting from char to wchar_t. It's similar for wcstombs(), and limited to Unicode ordinals below 256. However, the "C" locale isn't consistently Latin-1 across other categories. IIRC, LC_TIME in the "C" locale uses the process ANSI code page for time-zone names, and mojibake is common.
Anyway, UTF-8 mode is the way to go these days, esp. if you want to write applications which are portable across platforms and behave the same on all.
Globally setting PYTHONUTF8 forces all scripts to use UTF-8 as the default for open(). I'd like to let scripts opt in to using UTF-8 as the default for open() by way of an explicit setlocale() call such as setlocale(LC_CTYPE, (getdefaultlocale(), "UTF-8")) or, Windows only, setlocale(LC_CTYPE, ".UTF-8"). In POSIX, Python already tries coercing the "C" and "POSIX" locales (usually ASCII) to use UTF-8.