Re: [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale

4 May 2017

      On 4 May 2017 at 12:24, INADA Naoki  wrote:
...
[PEP 538]
...
* PEP 540 proposes to entirely decouple CPython's default text encoding from
  the C locale system in that case, allowing text handling inconsistencies to
  arise between CPython and other locale-aware components running in the same
  process and in subprocesses. This approach aims to make CPython behave less
  like a locale-aware application, and more like locale-independent language
  runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html says:
...
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
I don't know about .NET runtime on Unix much.  (mono and .NET Core).
"Go, Node.js and Rust" seems enough examples.
I'll push an update to drop the JVM and .NET from the list of examples.
...
...
New build-time configuration options
------------------------------------
[snip]
In case of (b), while warning about C locale is not shown, warning
about coercion
is still shown.  So when people don't want to see warning under C
locale and there is no
(C.UTF-8, C.utf8, UTF-8) locales, there are three ways:
* Set PYTHONUTF=1 (if PEP 540 is accepted)
* Set PYTHONCOERCECLOCALE=0.
* Use both of ``--without-c-locale-coercion`` and ``--without-c-locale-warning``
  configure options.
Is my understanding right?
Yes, that sounds right.
...
BTW, I prefer PEP 540 provides ``--with-utf8mode`` option which
enables UTF-8 mode
by default.  And if it is added, there are too few use cases for
``--without-c-locale-warning``.
There are some use cases people want to use UTF-8 by default in system
wide. (e.g.
container, webserver in Cent OS, etc...)
On the other hand, most of C locale usage are "per application" basis,
rather than "system wide."
configure option is not suitable for such per application setting, off course.
Yeah, in addition to Barry requesting such an option in one of the
earlier linux-sig reviews, my main rationale for including it is that
providing both config options offers a quick compatibility fix for any
distro where emitting the coercion and/or C locale warning on stderr
causes problems.

The only one of those that Fedora encountered in the F26 alpha was
deemed a bug in the affected application (something in autotools was
checking for "no output on stderr" instead of "subprocess exit code is
0", and the fix was to switch it to check the subprocess exit code),
but there are enough Linux distros and BSD variants out there that I'm
a lot more comfortable shipping the change with straightforward "off"
switches for the stderr output.
...
But I don't propose removing the option from PEP 538.
We can discuss about reducing configure options later.
+1.
...
...
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
Windows) these preprocessor variables would always be undefined.
Why ``--with[out]-c-locale-coercion`` have no effect on macOS, iOS and Android?
On these three, we know the system encoding is UTF-8, so we never
interpreted the C locale as meaning "ascii" in the first place.
...
On Android, locale coercion fixes readline.  Do you mean locale
coercion happen always
regardless this configuration option?
Right, the change for Android is that we switch to calling
'setlocale(LC_ALL, "C.UTF-8")' during interpreter startup instead of
'setlocale(LC_ALL, "")'. That change is guarded by "#ifdef
__ANDROID__", rather than either of the new conditionals.
...
On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to
``ascii:surrogateescape``?
Similar to Android, CPython itself is hardcoded to assume UTF-8 on Mac
OS X, since that's a platform API guarantee that users can't change.
...
Even so, locale coercion may fix libraries like readline, curses.
While C locale is less common on macOS, I don't understand any
reason to disable it on macOS.
My understanding is that other libraries and applications also
automatically use UTF-8 for system interfaces on Mac OS X and iOS. It
could be that that understanding is wrong, and locale coercion would
provide a benefit there as well.

(Checking the draft implementation, it turns out I haven't actually
implemented the configure logic to make those config settings platform
dependent yet - they're currently only undefined on Windows by
default, since that doesn't use the autotools based build system)
...
I know almost nothing about iOS, but it's similar to Android or macOS
in my expectation.
...
Improving the handling of the C locale
--------------------------------------
...
...
locale settings for locale-aware operations. Both the JVM and the .NET CLR
use UTF-16-LE as their primary encoding for passing text between applications
and the underlying platform.
JVM and .NET examples are misleading again.
They just use UTF-16-LE for syscall on Windows, like Python.
I don't know about them much, but I believe they don't use UTF-16 for system
encoding on Linux.
Sorry, this was ambiguous - it's meant to refer to applications
calling in to the JVM or CLR app runtime, not to the JVM or CLR
calling out to the host operating system. I'll try to make it clearer
in the next update.
...
...
Defaulting to "surrogateescape" error handling on the standard IO streams
-------------------------------------------------------------------------
By coercing the locale away from the legacy C default and its assumption of
ASCII as the preferred text encoding, this PEP also disables the implicit use
of the "surrogateescape" error handler on the standard IO streams that was
introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
I agree that this PEP shouldn't break byte transparent behavior in C locale by
coercing.
But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8
locale can be pitfall.
I read following part of the section and I agree that there is no way to solve
all issue.
But how about using surrogateescape handler in C.* locales like C locale?
That would be entirely possible, as the code responsible for that
adjustment is the lines:

            char *loc = setlocale(LC_CTYPE, NULL);
            if (loc != NULL && strcmp(loc, "C") == 0)
                errors = "surrogateescape";

Changing that to include "C.UTF-8" as a second locale that also
implies the use of `surrogateescape` would be low risk, and means we
wouldn't need to call Py_SetStandardStreamEncoding.

As a result, non UTF-8 data (such as latin-1 or GB-18030) would
automatically round-trip, regardless of whether C.UTF-8 was explicitly
set as the locale, or reached as the result of locale coercion.
...
It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale
at least.
It will also extend host/container encoding mismatch compatibility to
containers that explicitly set the C.UTF-8 locale.

That makes me more confident in making that change, as it would be
rather counterproductive if our changes gave base image developers an
incentive *not* to set C.UTF-8 as their default locale.
...
Anyway, I think https://bugs.python.org/issue15216 should be fixed in
Python 3.7 too.
Python applications which requires byte transparent stdio can use
`set_encoding(errors="surrogateescape")` explicitly.
Agreed.

Cheers,
Nick.

P.S. I've pushed the JVM/CLR related clarifications, but the standard
stream changes will require a bit more thought and corresponding
updates to the reference implementation - I'll aim to get to that this
weekend.

-- 
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia