[Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale

Wed May 3 22:24:27 EDT 2017

Hi, Nick and all core devs who are interested in this PEP.

I'm reviewing PEP 538 and I want to accept it in this month.
It will reduces much UnicodeError pains which server-side OPs facing.
Thank you Nick for working on this PEP.

If you have something worrying about this PEP, please post a comment
soon.  If you don't have enough time to read entire this PEP, feel free to
ask a question about you're worrying.

Here is my comments:

>
> Relationship with other PEPs
> ============================
>
> This PEP shares a common problem statement with PEP 540 (improving Python
> 3's
> behaviour in the default C locale), but diverges markedly in the proposed
> solution:
>
> * PEP 540 proposes to entirely decouple CPython's default text encoding from
>   the C locale system in that case, allowing text handling inconsistencies to
>   arise between CPython and other locale-aware components running in the same
>   process and in subprocesses. This approach aims to make CPython behave less
>   like a locale-aware application, and more like locale-independent language
>   runtimes like the JVM, .NET CLR, Go, Node.js, and Rust

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html says:

> Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

I don't know about .NET runtime on Unix much.  (mono and .NET Core).
"Go, Node.js and Rust" seems enough examples.

> New build-time configuration options
> ------------------------------------
>
> While both of the above behaviours would be enabled by default, they would
> also have new associated configuration options and preprocessor definitions
> for the benefit of redistributors that want to override those default
> settings.
>
> The locale coercion behaviour would be controlled by the flag
> ``--with[out]-c-locale-coercion``, which would set the
> ``PY_COERCE_C_LOCALE``
> preprocessor definition.
>
> The locale warning behaviour would be controlled by the flag
> ``--with[out]-c-locale-warning``, which would set the
> ``PY_WARN_ON_C_LOCALE``
> preprocessor definition.

"locale warning" means warning printed when C locale is used, am I right?

As my understanding, "locale warning" is shown in these cases (all cases implies
under C locale and PYTHONUTF8 is not enabled).

a. C locale is used and locale coercion is disabled by
   ``--without-c-locale-coercion`` configure option.
b. locale coercion is failed since there is none of C.UTF-8, C.utf8,
nor UTF-8 locale.
c. Python is embedded. locale coercion can't be used in this case.

In case of (b), while warning about C locale is not shown, warning
about coercion
is still shown.  So when people don't want to see warning under C
locale and there is no
(C.UTF-8, C.utf8, UTF-8) locales, there are three ways:

* Set PYTHONUTF=1 (if PEP 540 is accepted)
* Set PYTHONCOERCECLOCALE=0.
* Use both of ``--without-c-locale-coercion`` and ``--without-c-locale-warning``
  configure options.

Is my understanding right?

BTW, I prefer PEP 540 provides ``--with-utf8mode`` option which
enables UTF-8 mode
by default.  And if it is added, there are too few use cases for
``--without-c-locale-warning``.

There are some use cases people want to use UTF-8 by default in system
wide. (e.g.
container, webserver in Cent OS, etc...)

On the other hand, most of C locale usage are "per application" basis,
rather than "system wide."
configure option is not suitable for such per application setting, off course.

But I don't propose removing the option from PEP 538.
We can discuss about reducing configure options later.

>
> On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
> Windows) these preprocessor variables would always be undefined.
>

Why ``--with[out]-c-locale-coercion`` have no effect on macOS, iOS and Android?

On Android, locale coercion fixes readline.  Do you mean locale
coercion happen always
regardless this configuration option?

On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to
``ascii:surrogateescape``?
Even so, locale coercion may fix libraries like readline, curses.
While C locale is less common on macOS, I don't understand any
reason to disable it on macOS.

I know almost nothing about iOS, but it's similar to Android or macOS
in my expectation.

> Improving the handling of the C locale
> --------------------------------------
>
...
> locale settings for locale-aware operations. Both the JVM and the .NET CLR
> use UTF-16-LE as their primary encoding for passing text between applications
> and the underlying platform.

JVM and .NET examples are misleading again.
They just use UTF-16-LE for syscall on Windows, like Python.

I don't know about them much, but I believe they don't use UTF-16 for system
encoding on Linux.

> Defaulting to "surrogateescape" error handling on the standard IO streams
> -------------------------------------------------------------------------
> By coercing the locale away from the legacy C default and its assumption of
> ASCII as the preferred text encoding, this PEP also disables the implicit use
> of the "surrogateescape" error handler on the standard IO streams that was
> introduced in Python 3.5 ([15_]), as well as the automatic use of
> ``surrogateescape`` when operating in PEP 540's UTF-8 mode.
>

I agree that this PEP shouldn't break byte transparent behavior in C locale by
coercing.
But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8
locale can be pitfall.

I read following part of the section and I agree that there is no way to solve
all issue.
But how about using surrogateescape handler in C.* locales like C locale?
It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale
at least.

Anyway, I think https://bugs.python.org/issue15216 should be fixed in
Python 3.7 too.
Python applications which requires byte transparent stdio can use
`set_encoding(errors="surrogateescape")` explicitly.

Regards,