On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
I think the main problem here is that open() doesn't use locale.getlocale()[1] as default for the encoding parameter, but instead locale.getpreferredencoding(False).
Currently, locale.getpreferredencoding(False) is implemented as locale._get_locale_encoding(). This ultimately calls _Py_GetLocaleEncoding(), defined in "Python/fileutils.c". TextIOWrapper() calls this C function to get the encoding to use when encoding=None is passed. In POSIX, _Py_GetLocaleEncoding() calls nl_langinfo(CODESET), which returns the current LC_CTYPE encoding, not the default LC_CTYPE encoding. For example, in Linux: >>> setlocale(LC_CTYPE, 'en_US.UTF-8') 'en_US.UTF-8' >>> _get_locale_encoding() 'UTF-8' >>> open('test.txt').encoding 'UTF-8' >>> setlocale(LC_CTYPE, 'en_US.ISO-8859-1') 'en_US.ISO-8859-1' >>> _get_locale_encoding() 'ISO-8859-1' >>> open('test.txt').encoding 'ISO-8859-1' In Windows, _Py_GetLocaleEncoding() just uses GetACP(), which returns the process ANSI code page. This is based on the CRT's default locale set by setlocale(LC_CTYPE, ""), which combines the user's default locale with the process ANSI code page. I'm not overjoyed about this combination in the default locale, since it's potentially inconsistent (e.g. Korean user locale with Latin 1252 process code page), but that ship sailed a long time ago. I'm not arguing to change locale.getdefaultlocale(). The problem is that locale._get_locale_encoding() in Windows is not returning the current LC_CTYPE locale encoding, in contrast to how it behaves in POSIX. I'd like an environment variable and/or -X option to fix this flaw. If enabled, and if the C runtime supports UTF-8 locales (as it has for the past 3 years in Windows 10), and the application warrants it (e.g. many open calls across many modules), then convenient use of UTF-8 would be one setlocale() call away. It's not for packages. Frankly, I don't see why it's a problem for a package developer to use encoding='utf-8' for files that need to use UTF-8. Developing libraries that are designed to work in arbitrary applications on multiple platforms is tedious work. Having to explicitly pass encoding='utf-8' goes with the territory, and it's a minor annoyance in the grand scheme of things.
That's what getlocale(LC_CTYPE) is intended for, unless I'm missing something.
getlocale() can't be relied on to parse the correct codeset from the locale name, and it can even raise ValueError (more likely in Windows, e.g. with the native locale name "en-US"). The codeset should be queried directly using an API call, such as nl_langinfo(CODESET) in POSIX. In Windows, the C runtime's POSIX locale implementation doesn't include nl_langinfo(). There's ___lc_codepage_func(), but it's documented as an internal function. A ucrt locale record, however, does expose the code page as a public field, as documented in the public header "corecrt.h". Here's a prototype using ctypes: import os import ctypes ucrt = ctypes.CDLL('ucrtbase', use_errno=True) class _crt_locale_data_public(ctypes.Structure): _fields_ = (('_locale_pctype', ctypes.POINTER(ctypes.c_ushort)), ('_locale_mb_cur_max', ctypes.c_int), ('_locale_lc_codepage', ctypes.c_uint)) class _crt_locale_pointers(ctypes.Structure): _fields_ = (('locinfo', ctypes.POINTER(_crt_locale_data_public)), ('mbcinfo', ctypes.c_void_p)) ucrt._get_current_locale.restype = ctypes.POINTER(_crt_locale_pointers) CP_UTF8 = 65001 def _get_locale_encoding(): locale = ucrt._get_current_locale() if not locale: errno = ctypes.get_errno() raise OSError(errno, os.strerror(errno)) try: codepage = locale[0].locinfo[0]._locale_lc_codepage finally: ucrt._free_locale(locale) if codepage == 0: return 'latin-1' # "C" locale if codepage == CP_UTF8: return 'utf-8' return f'cp{cp}' Examples with Python 3.9 in Windows 10: >>> setlocale(LC_CTYPE, 'C') 'C' >>> _get_locale_encoding() 'latin-1' >>> setlocale(LC_CTYPE, 'en_US') 'en_US' >>> _get_locale_encoding() 'cp1252' >>> setlocale(LC_CTYPE, 'el_GR') 'el_GR' >>> _get_locale_encoding() 'cp1253' >>> setlocale(LC_CTYPE, 'en_US.utf-8') 'en_US.utf-8' >>> _get_locale_encoding() 'utf-8'