[Python-Dev] PEP 540: Add a new UTF-8 mode (v3)

Fri Dec 8 05:22:25 EST 2017

Hi,

Oh, locale.getpreferredencoding(), that's a good question :-)

2017-12-08 6:02 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:
> But I want to clarify more about difference/relationship between PEP
> 538 and 540.
>
> If I understand correctly:
>
> Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares
> same logic to detect POSIX locale.
>
> When POSIX locale is detected, locale coercion is tried first. And if
> locale coercion
> succeeds,  UTF-8 mode is not used because locale is not POSIX anymore.

No, I would like to enable the UTF-8 mode as well in this case.

In short, locale coercion and UTF-8 mode will be both enabled by the
POSIX locale.

> If locale coercion is disabled or failed, UTF-8 mode is used automatically,
> unless it is disabled explicitly.

PEP 540 is always enabled if the POSIX locale is detected. Only
PYTHONUTF8=0 or -X utf8=0 disable it in this case.

Disabling locale coercion doesn't disable the PEP 540.

> UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales.
> But UTF-8 mode is different from C.UTF-8 locale in these ways because
> actual locale is not changed:
>
> * Libraries using locale (e.g. readline) works as in POSIX locale.  So UTF-8
>   cannot be used in such libraries.

My assumption is that very few C library rely on the locale encoding.
The wchar_t* type is rarely used. You may only get issues if Python
pass UTF-8 encoded string to a C library which tries to decode it from
the locale encoding which is not UTF-8. For example, with the POSIX
locale, if the locale encoding is ASCII, you can get a decoding error
if a C library tries to decode a UTF-8 encoded string coming from
Python.

But the encoding problem is not restricted to the current process. For
the "producer | consumer" model, if the producer is a Python 3.7
application using UTF-8 mode and so encoding text to UTF-8 to stdout,
an application may be unable to decode the UTF-8 data. Here we enter
the grey area of encodings. Which applications rely use the locale
encoding? Which applications always use UTF-8? Do some applications
try UTF-8 first, or falls back on the locale encoding? (OpenSSL does
that on filenames for example, as the glib if I recall correctly.)

Until we know exactly how UTF-8 is used in the "wild", I chose to make
the UTF-8 an opt-in option for locales other than POSIX. I expect a
few bugs reports later which will help us to adjust our encodings.

> * locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'.  So
>   libraries depending on locale.getpreferredencoding() may raise
>   UnicodeErrors.

Right.

> Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too?

Here is where the PEP 538 plays very nicely with the PEP 540. On
platforms where the locale coercion is supported (Fedora, macOS,
FreeBSD, maybe other Linux distributons), on the POSIX locale,
locale.getpreferredencoding() will return UTF-8 and functions like
mbstowcs() will use the UTF-8 encoding internally.

Currently, in the implementation of my PEP 540, I chose to modify
open() to use UTF-8 if the UTF-8 mode is used, rather using
locale.getpreferredencoding().

Victor