Hi, Nick and all core devs who are interested in this PEP.
I'm reviewing PEP 538 and I want to accept it in this month. It will reduces much UnicodeError pains which server-side OPs facing. Thank you Nick for working on this PEP.
If you have something worrying about this PEP, please post a comment soon. If you don't have enough time to read entire this PEP, feel free to ask a question about you're worrying.
Here is my comments:
Relationship with other PEPs
This PEP shares a common problem statement with PEP 540 (improving Python 3's behaviour in the default C locale), but diverges markedly in the proposed solution:
- PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other locale-aware components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
I don't know about .NET runtime on Unix much. (mono and .NET Core). "Go, Node.js and Rust" seems enough examples.
New build-time configuration options
While both of the above behaviours would be enabled by default, they would also have new associated configuration options and preprocessor definitions for the benefit of redistributors that want to override those default settings.
The locale coercion behaviour would be controlled by the flag ``--with[out]-c-locale-coercion``, which would set the ``PY_COERCE_C_LOCALE`` preprocessor definition.
The locale warning behaviour would be controlled by the flag ``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE`` preprocessor definition.
"locale warning" means warning printed when C locale is used, am I right?
As my understanding, "locale warning" is shown in these cases (all cases implies under C locale and PYTHONUTF8 is not enabled).
a. C locale is used and locale coercion is disabled by ``--without-c-locale-coercion`` configure option. b. locale coercion is failed since there is none of C.UTF-8, C.utf8, nor UTF-8 locale. c. Python is embedded. locale coercion can't be used in this case.
In case of (b), while warning about C locale is not shown, warning about coercion is still shown. So when people don't want to see warning under C locale and there is no (C.UTF-8, C.utf8, UTF-8) locales, there are three ways:
* Set PYTHONUTF=1 (if PEP 540 is accepted) * Set PYTHONCOERCECLOCALE=0. * Use both of ``--without-c-locale-coercion`` and ``--without-c-locale-warning`` configure options.
Is my understanding right?
BTW, I prefer PEP 540 provides ``--with-utf8mode`` option which enables UTF-8 mode by default. And if it is added, there are too few use cases for ``--without-c-locale-warning``.
There are some use cases people want to use UTF-8 by default in system wide. (e.g. container, webserver in Cent OS, etc...)
On the other hand, most of C locale usage are "per application" basis, rather than "system wide." configure option is not suitable for such per application setting, off course.
But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined.
Why ``--with[out]-c-locale-coercion`` have no effect on macOS, iOS and Android?
On Android, locale coercion fixes readline. Do you mean locale coercion happen always regardless this configuration option?
On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to ``ascii:surrogateescape``? Even so, locale coercion may fix libraries like readline, curses. While C locale is less common on macOS, I don't understand any reason to disable it on macOS.
I know almost nothing about iOS, but it's similar to Android or macOS in my expectation.
Improving the handling of the C locale
locale settings for locale-aware operations. Both the JVM and the .NET CLR use UTF-16-LE as their primary encoding for passing text between applications and the underlying platform.
JVM and .NET examples are misleading again. They just use UTF-16-LE for syscall on Windows, like Python.
I don't know about them much, but I believe they don't use UTF-16 for system encoding on Linux.
Defaulting to "surrogateescape" error handling on the standard IO streams
By coercing the locale away from the legacy C default and its assumption of ASCII as the preferred text encoding, this PEP also disables the implicit use of the "surrogateescape" error handler on the standard IO streams that was introduced in Python 3.5 ([15_]), as well as the automatic use of ``surrogateescape`` when operating in PEP 540's UTF-8 mode.
I agree that this PEP shouldn't break byte transparent behavior in C locale by coercing. But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8 locale can be pitfall.
I read following part of the section and I agree that there is no way to solve all issue. But how about using surrogateescape handler in C.* locales like C locale? It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least.
Anyway, I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python applications which requires byte transparent stdio can use `set_encoding(errors="surrogateescape")` explicitly.