[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

11 Feb 2021

      On 2/11/21, M.-A. Lemburg <mal@egenix.com> wrote:
...
I think the main problem here is that open() doesn't use
locale.getlocale()[1] as default for the encoding parameter,
but instead locale.getpreferredencoding(False).
Currently, locale.getpreferredencoding(False) is implemented as
locale._get_locale_encoding(). This ultimately calls
_Py_GetLocaleEncoding(), defined in "Python/fileutils.c".
TextIOWrapper() calls this C function to get the encoding to use when
encoding=None is passed.

In POSIX, _Py_GetLocaleEncoding() calls nl_langinfo(CODESET), which
returns the current LC_CTYPE encoding, not the default LC_CTYPE
encoding. For example, in Linux:

    >>> setlocale(LC_CTYPE, 'en_US.UTF-8')
    'en_US.UTF-8'
    >>> _get_locale_encoding()
    'UTF-8'
    >>> open('test.txt').encoding
    'UTF-8'

    >>> setlocale(LC_CTYPE, 'en_US.ISO-8859-1')
    'en_US.ISO-8859-1'
    >>> _get_locale_encoding()
    'ISO-8859-1'
    >>> open('test.txt').encoding
    'ISO-8859-1'

In Windows, _Py_GetLocaleEncoding() just uses GetACP(), which returns
the process ANSI code page. This is based on the CRT's default locale
set by setlocale(LC_CTYPE, ""), which combines the user's default
locale with the process ANSI code page. I'm not overjoyed about this
combination in the default locale, since it's potentially inconsistent
(e.g. Korean user locale with Latin 1252 process code page), but that
ship sailed a long time ago. I'm not arguing to change
locale.getdefaultlocale().

The problem is that locale._get_locale_encoding() in Windows is not
returning the current LC_CTYPE locale encoding, in contrast to how it
behaves in POSIX. I'd like an environment variable and/or -X option to
fix this flaw. If enabled, and if the C runtime supports UTF-8 locales
(as it has for the past 3 years in Windows 10), and the application
warrants it (e.g. many open calls across many modules), then
convenient use of UTF-8 would be one setlocale() call away.

It's not for packages. Frankly, I don't see why it's a problem for a
package developer to use encoding='utf-8' for files that need to use
UTF-8. Developing libraries that are designed to work in arbitrary
applications on multiple platforms is tedious work. Having to
explicitly pass encoding='utf-8' goes with the territory, and it's a
minor annoyance in the grand scheme of things.
...
That's what getlocale(LC_CTYPE) is intended for, unless I'm
missing something.
getlocale() can't be relied on to parse the correct codeset from the
locale name, and it can even raise ValueError (more likely in Windows,
e.g. with the native locale name "en-US"). The codeset should be
queried directly using an API call, such as nl_langinfo(CODESET) in
POSIX.

In Windows, the C runtime's POSIX locale implementation doesn't
include nl_langinfo(). There's ___lc_codepage_func(), but it's
documented as an internal function. A ucrt locale record, however,
does expose the code page as a public field, as documented in the
public header "corecrt.h". Here's a prototype using ctypes:

    import os
    import ctypes

    ucrt = ctypes.CDLL('ucrtbase', use_errno=True)

    class _crt_locale_data_public(ctypes.Structure):
        _fields_ = (('_locale_pctype', ctypes.POINTER(ctypes.c_ushort)),
                    ('_locale_mb_cur_max', ctypes.c_int),
                    ('_locale_lc_codepage', ctypes.c_uint))

    class _crt_locale_pointers(ctypes.Structure):
        _fields_ = (('locinfo', ctypes.POINTER(_crt_locale_data_public)),
                    ('mbcinfo', ctypes.c_void_p))

    ucrt._get_current_locale.restype = ctypes.POINTER(_crt_locale_pointers)

    CP_UTF8 = 65001

    def _get_locale_encoding():
        locale = ucrt._get_current_locale()
        if not locale:
            errno = ctypes.get_errno()
            raise OSError(errno, os.strerror(errno))
        try:
            codepage = locale[0].locinfo[0]._locale_lc_codepage
        finally:
            ucrt._free_locale(locale)
        if codepage == 0:
            return 'latin-1' # "C" locale
        if codepage == CP_UTF8:
            return 'utf-8'
        return f'cp{cp}'

Examples with Python 3.9 in Windows 10:

    >>> setlocale(LC_CTYPE, 'C')
    'C'
    >>> _get_locale_encoding()
    'latin-1'
    >>> setlocale(LC_CTYPE, 'en_US')
    'en_US'
    >>> _get_locale_encoding()
    'cp1252'
    >>> setlocale(LC_CTYPE, 'el_GR')
    'el_GR'
    >>> _get_locale_encoding()
    'cp1253'
    >>> setlocale(LC_CTYPE, 'en_US.utf-8')
    'en_US.utf-8'
    >>> _get_locale_encoding()
    'utf-8'

[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

Eryk Sun