PEP 538: Coercing the legacy C locale to a UTF-8 based locale

Hi folks, Late last year I started working on a change to the CPython CLI (*not* the shared library) to get it to coerce the legacy C locale to something based on UTF-8 when a suitable locale is available. After a couple of rounds of iteration on linux-sig and python-ideas, I'm now bringing it to python-dev as a concrete proposal for Python 3.7. For most folks, reading the Abstract plus the draft docs updates in the reference implementation will tell you everything you need to know (if the C.UTF-8, C.utf8 or UTF-8 locales are available, the CLI will automatically attempt to coerce the legacy C locale to one of those rather than persisting with the latter's default assumption of ASCII as the preferred text encoding). However, the full PEP goes into a lot more detail on: * exactly what's broken about CPython's behaviour in the legacy C locale * why I'm in favour of this particular approach to fixing it (i.e. it integrates better with other C/C++ components, as well as being amenable to redistributor backports for 3.6, and environment based configuration for 3.5 and earlier) * why I think implementing both this change *and* Victor's more comprehensive "PYTHONUTF8 mode" proposal in PEP 540 will be better than implementing just one or the other (in some situations, ignoring the platform locale subsystem entirely really is the right approach, and that's the aspect PEP 540 tackles, while this PEP tackles the situations where the C locale behaviour is broken, but you still need to be consistent with the platform settings). Cheers, Nick. ================================== PEP: 538 Title: Coercing the legacy C locale to a UTF-8 based locale Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan <ncoghlan@gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-Dec-2016 Python-Version: 3.7 Post-History: 03-Jan-2017 (linux-sig), 07-Jan-2017 (python-ideas), 05-Mar-2017 (python-dev) Abstract ======== An ongoing challenge with Python 3 on \*nix systems is the conflict between needing to use the configured locale encoding by default for consistency with other C/C++ components in the same process and those invoked in subprocesses, and the fact that the standard C locale (as defined in POSIX:2001) typically implies a default text encoding of ASCII, which is entirely inadequate for the development of networked services and client applications in a multilingual world. PEP 540 proposes a change to CPython's handling of the legacy C locale such that CPython will assume the use of UTF-8 in such environments, rather than persisting with the demonstrably problematic assumption of ASCII as an appropriate encoding for communicating with operating system interfaces. This is a good approach for cases where network encoding interoperability is a more important concern than local encoding interoperability. However, it comes at the cost of making CPython's encoding assumptions diverge from those of other C and C++ components in the same process, as well as those of components running in subprocesses that share the same environment. It also requires changes to the internals of how CPython itself works, rather than using existing configuration settings that are supported by Python versions prior to Python 3.7. Accordingly, this PEP proposes that independently of the UTF-8 mode proposed in PEP 540, the way the CPython implementation handles the default C locale be changed such that: * unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``, the standalone CPython binary will automatically attempt to coerce the ``C`` locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` * if the locale is successfully coerced, and PEP 540 is not accepted, then ``PYTHONIOENCODING`` (if not otherwise set) will be set to ``utf-8:surrogateescape``. * if the locale is successfully coerced, and PEP 540 *is* accepted, then ``PYTHONUTF8`` (if not otherwise set) will be set to ``1`` * if the subsequent runtime initialization process detects that the legacy ``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` are available, locale coercion is disabled, or the runtime is embedded in an application other than the main CPython binary), and the ``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not implemented), it will emit a warning on stderr that use of the legacy ``C`` locale's default ASCII text encoding may cause various Unicode compatibility issues With this change, any \*nix platform that does *not* offer at least one of the ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard configuration would only be considered a fully supported platform for CPython 3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is used, or else a suitable locale other than the default ``C`` locale is configured explicitly (e.g. `en_AU.UTF-8`, ``zh_CN.gb18030``). Redistributors (such as Linux distributions) with a narrower target audience than the upstream CPython development team may also choose to opt in to this locale coercion behaviour for the Python 3.6.x series by applying the necessary changes as a downstream patch when first introducing Python 3.6.0. Background ========== While the CPython interpreter is starting up, it may need to convert from the ``char *`` format to the ``wchar_t *`` format, or from one of those formats to ``PyUnicodeObject *``, in a way that's consistent with the locale settings of the overall system. It handles these cases by relying on the operating system to do the conversion and then ensuring that the text encoding name reported by ``sys.getfilesystemencoding()`` matches the encoding used during this early bootstrapping process. On Apple platforms (including both Mac OS X and iOS), this is straightforward, as Apple guarantees that these operations will always use UTF-8 to do the conversion. On Windows, the limitations of the ``mbcs`` format used by default in these conversions proved sufficiently problematic that PEP 528 and PEP 529 were implemented to bypass the operating system supplied interfaces for binary data handling and force the use of UTF-8 instead. On Android, many components, including CPython, already assume the use of UTF-8 as the system encoding, regardless of the locale setting. However, this isn't the case for all components, and the discrepancy can cause problems in some situations (for example, when using the GNU readline module [16_]). On non-Apple and non-Android \*nix systems, these operations are handled using the C locale system in glibc, which has the following characteristics [4_]: * by default, all processes start in the ``C`` locale, which uses ``ASCII`` for these conversions. This is almost never what anyone doing multilingual text processing actually wants (including CPython and C/C++ GUI frameworks). * calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on the locale categories configured in the current process environment * if the locale requested by the current environment is unknown, or no specific locale is configured, then the default ``C`` locale will remain active The specific locale category that covers the APIs that CPython depends on is ``LC_CTYPE``, which applies to "classification and conversion of characters, and to multibyte and wide characters" [5_]. Accordingly, CPython includes the following key calls to ``setlocale``: * in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to configure the entire C locale subsystem according to the process environment. It does this prior to making any calls into the shared CPython library * in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that the configured locale settings for that category *always* match those set in the environment. It does this unconditionally, and it *doesn't* revert the process state change in ``Py_Finalize`` (This summary of the locale handling omits several technical details related to exactly where and when the text encoding declared as part of the locale settings is used - see PEP 540 for further discussion, as these particular details matter more when decoupling CPython from the declared C locale than they do when overriding the locale with one based on UTF-8) These calls are usually sufficient to provide sensible behaviour, but they can still fail in the following cases: * SSH environment forwarding means that SSH clients may sometimes forward client locale settings to servers that don't have that locale installed. This leads to CPython running in the default ASCII-based C locale * some process environments (such as Linux containers) may not have any explicit locale configured at all. As with unknown locales, this leads to CPython running in the default ASCII-based C locale The simplest way to deal with this problem for currently released versions of CPython is to explicitly set a more sensible locale when launching the application. For example:: LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ... The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the ``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other categories (including ``LC_COLLATE``). It is offered by a number of Linux distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an alternative to the ASCII-based C locale. Mac OS X and other \*BSD systems have taken a different approach, and instead of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale that only defines the ``LC_CTYPE`` category. On such systems, the preferred environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set ``LC_ALL`` or ``LANG``. [17_] In the specific case of Docker containers and similar technologies, the appropriate locale setting can be specified directly in the container image definition. Another common failure case is developers specifying ``LANG=C`` in order to see otherwise translated user interface messages in English, rather than the more narrowly scoped ``LC_MESSAGES=C``. Relationship with other PEPs ============================ This PEP shares a common problem statement with PEP 540 (improving Python 3's behaviour in the default C locale), but diverges markedly in the proposed solution: * PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other C/C++ components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware C/C++ application, and more like C/C++ independent language runtimes like the JVM, .NET CLR, Go, Node.js, and Rust * this PEP proposes to override the legacy C locale with a more recently defined locale that uses UTF-8 as its default text encoding. This means that the text encoding override will apply not only to CPython, but also to any locale aware extension modules loaded into the current process, as well as to locale aware C/C++ applications invoked in subprocesses that inherit their environment from the parent process. This approach aims to retain CPython's traditional strong support for integration with other components written in C and C++, while actively helping to push forward the adoption and standardisation of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale in the wider C/C++ ecosystem After reviewing both PEPs, it became clear that they didn't actually conflict at a technical level, and the proposal in PEP 540 offered a superior option in cases where no suitable locale was available, as well as offering a better reference behaviour for platforms where the notion of a "locale encoding" doesn't make sense (for example, embedded systems running MicroPython rather than the CPython reference interpreter). Meanwhile, this PEP offered improved compatibility with other C/C++ components, and an approach more amenable to being backported to Python 3.6 by downstream redistributors. As a result, this PEP was amended to refer to PEP 540 as a complementary solution that offered improved behaviour both when locale coercion triggered, as well as when none of the standard UTF-8 based locales were available. The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was removed from the list of UTF-8 locales tried as a coercion target, with CPython instead relying solely on the proposed PYTHONUTF8 mode in such cases. Motivation ========== While Linux container technologies like Docker, Kubernetes, and OpenShift are best known for their use in web service development, the related container formats and execution models are also being adopted for Linux command line application development. Technologies like Gnome Flatpak [7_] and Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI application development. When using Python 3 for application development in these contexts, it isn't uncommon to see text encoding related errors akin to the following:: $ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed $ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed Even though the same command is likely to work fine when run locally:: $ python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ The source of the problem can be seen by instead running the ``locale`` command in the three environments:: $ locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=en_AU.UTF-8 LC_CTYPE="en_AU.UTF-8" LC_ALL= $ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LC_CTYPE="POSIX" LC_ALL= $ docker run --rm ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LANGUAGE= LC_CTYPE="POSIX" LC_ALL= In this particular example, we can see that the host system locale is set to "en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By contrast, the base Docker images for Fedora and Debian don't have any specific locale set, so they use the POSIX locale by default, which is an alias for the ASCII-based default C locale. The simplest way to get Python 3 (regardless of the exact version) to behave sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8`` locale that both distros provide:: $ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LC_CTYPE="C.UTF-8" LC_ALL= $ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_ALL= The Alpine Linux based Python images provided by Docker, Inc, already use the C.UTF-8 locale by default:: $ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_ALL= Similarly, for custom container images (i.e. those adding additional content on top of a base distro image), a more suitable locale can be set in the image definition so everything just works by default. However, it would provide a much nicer and more consistent user experience if CPython were able to just deal with this problem automatically rather than relying on redistributors or end users to handle it through system configuration changes. While the glibc developers are working towards making the C.UTF-8 locale universally available for use by glibc based applications like CPython [6_], this unfortunately doesn't help on platforms that ship older versions of glibc without that feature, and also don't provide C.UTF-8 as an on-disk locale the way Debian and Fedora do. For these platforms, the mechanism proposed in PEP 540 at least allows CPython itself to behave sensibly, albeit without any mechanism to get other C/C++ components that decode binary streams as text to do the same. Design Principles ================= The above motivation leads to the following core design principles for the proposed solution: * if a locale other than the default C locale is explicitly configured, we'll continue to respect it * if we're changing the locale setting without an explicit config option, we'll emit a warning on stderr that we're doing so rather than silently changing the process configuration. This will alert application and system integrators to the change, even if they don't closely follow the PEP process or Python release announcements. However, to minimize the chance of introducing new problems for end users, we'll do this *without* using the warnings system, so even running with ``-Werror`` won't turn it into a runtime exception * any changes made will use *existing* configuration options To minimize the negative impact on systems currently correctly configured to use GB-18030 or another partially ASCII compatible universal encoding leads to an additional design principle: * if a UTF-8 based Linux container is run on a host that is explicitly configured to use a non-UTF-8 encoding, and tries to exchange locally encoded data with that host rather than exchanging explicitly UTF-8 encoded data, CPython will endeavour to correctly round-trip host provided data that is concatenated or split solely at common ASCII compatible code points, but may otherwise emit nonsensical results. Specification ============= To better handle the cases where CPython would otherwise end up attempting to operate in the ``C`` locale, this PEP proposes that CPython automatically attempt to coerce the legacy ``C`` locale to a UTF-8 based locale when it is run as a standalone command line application. It further proposes to emit a warning on stderr if the legacy ``C`` locale is in effect at the point where the language runtime itself is initialized, and the PEP 540 UTF-8 encoding override is also disabled, in order to warn system and application integrators that they're running CPython in an unsupported configuration. Legacy C locale coercion in the standalone Python interpreter binary -------------------------------------------------------------------- When run as a standalone application, CPython has the opportunity to reconfigure the C locale before any locale dependent operations are executed in the process. This means that it can change the locale settings not only for the CPython runtime, but also for any other C/C++ components running in the current process (e.g. as part of extension modules), as well as in subprocesses that inherit their environment from the current process. After calling ``setlocale(LC_ALL, "")`` to initialize the locale settings in the current process, the main interpreter binary will be updated to include the following call:: const char *ctype_loc = setlocale(LC_CTYPE, NULL); This cryptic invocation is the API that C provides to query the current locale setting without changing it. Given that query, it is possible to check for exactly the ``C`` locale with ``strcmp``:: ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C locale This call also returns ``"C"`` when either no particular locale is set, or the nominal locale is set to an alias for the ``C`` locale (such as ``POSIX``). Given this information, CPython can then attempt to coerce the locale to one that uses UTF-8 rather than ASCII as the default encoding. Three such locales will be tried: * ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and expected to be available by default in a future version of glibc) * ``C.utf8`` (available at least in HP-UX) * ``UTF-8`` (available in at least some \*BSD variants) For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate locale name, such that future calls to ``setlocale()`` will see them, as will other components looking for those settings (such as GUI development frameworks). For the platforms where it is defined, ``UTF-8`` is a partial locale that only defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE`` environment variable would be set when using this fallback option. To adjust automatically to future changes in locale availability, these checks will be implemented at runtime on all platforms other than Mac OS X and Windows, rather than attempting to determine which locales to try at compile time. If the locale settings are changed successfully, and the ``PYTHONIOENCODING`` environment variable is currently unset, then it will be forced to ``PYTHONIOENCODING=utf-8:surrogateescape``. When this locale coercion is activated, the following warning will be printed on stderr, with the warning containing whichever locale was successfully configured:: Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). When falling back to the ``UTF-8`` locale, the message would be slightly different:: Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). In combination with PEP 540, this locale coercion will mean that the standard Python binary *and* locale aware C/C++ extensions should once again "just work" in the three main failure cases we're aware of (missing locale settings, SSH forwarding of unknown locales, and developers explicitly requesting ``LANG=C``), as long as the target platform provides at least one of the candidate UTF-8 based environments. If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is successfully configured, then initialization will continue as usual in the C locale and the Unicode compatibility warning described in the next section will be emitted just as it would for any other application. The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment variable (even when running under the ``-E`` or ``-I`` switches), as the locale coercion check necessarily takes place before any command line argument processing. Changes to the runtime initialization process --------------------------------------------- By the time that ``Py_Initialize`` is called, arbitrary locale-dependent operations may have taken place in the current process. This means that by the time it is called, it is *too late* to switch to a different locale - doing so would introduce inconsistencies in decoded text, even in the context of the standalone Python interpreter binary. Accordingly, when ``Py_Initialize`` is called and CPython detects that the configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled, the following warning will be issued:: Python runtime initialized with LC_CTYPE=C (a locale with default ASCII encoding), which may cause Unicode compatibility problems. Using C.UTF-8 C.utf8, or UTF-8 (if available) as alternative Unicode-compatible locales is recommended. In this case, no actual change will be made to the locale settings. Instead, the warning informs both system and application integrators that they're running Python 3 in a configuration that we don't expect to work properly. The second sentence providing recommendations would be conditionally compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD systems. New build-time configuration options ------------------------------------ While both of the above behaviours would be enabled by default, they would also have new associated configuration options and preprocessor definitions for the benefit of redistributors that want to override those default settings. The locale coercion behaviour would be controlled by the flag ``--with[out]-c-locale-coercion``, which would set the ``PY_COERCE_C_LOCALE`` preprocessor definition. The locale warning behaviour would be controlled by the flag ``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE`` preprocessor definition. On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined. Platform Support Changes ======================== A new "Legacy C Locale" section will be added to PEP 11 that states: * as of CPython 3.7, the legacy C locale is only supported when operating in "UTF-8" mode. Any Unicode handling issues that occur only in that locale and cannot be reproduced in an appropriately configured non-ASCII locale will be closed as "won't fix" * as of CPython 3.7, \*nix platforms are expected to provide at least one of ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` ( ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale. Any Unicode related integration problems with C/C++ extensions that occur only in that locale and cannot be reproduced in an appropriately configured non-ASCII locale will be closed as "won't fix". Rationale ========= Improving the handling of the C locale -------------------------------------- It has been clear for some time that the C locale's default encoding of ``ASCII`` is entirely the wrong choice for development of modern networked services. Newer languages like Rust and Go have eschewed that default entirely, and instead made it a deployment requirement that systems be configured to use UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine) and requires custom build settings to indicate it should use the system locale settings for locale-aware operations. Both the JVM and the .NET CLR use UTF-16-LE as their primary encoding for passing text between applications and the underlying platform. The challenge for CPython has been the fact that in addition to being used for network service development, it is also extensively used as an embedded scripting language in larger applications, and as a desktop application development language, where it is more important to be consistent with other C/C++ components sharing the same process, as well as with the user's desktop locale settings, than it is with the emergent conventions of modern network service development. The core premise of this PEP is that for *all* of these use cases, the assumption of ASCII implied by the default "C" locale is the wrong choice, and furthermore that the following assumptions are valid: * in desktop application use cases, the process locale will *already* be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer * in network service development use cases (especially those based on Linux containers), the process locale may not be configured *at all*, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does Defaulting to "surrogateescape" error handling on the standard IO streams ------------------------------------------------------------------------- By coercing the locale away from the legacy C default and its assumption of ASCII as the preferred text encoding, this PEP also disables the implicit use of the "surrogateescape" error handler on the standard IO streams that was introduced in Python 3.5 ([15_]), as well as the automatic use of ``surrogateescape`` when operating in PEP 540's UTF-8 mode. Rather than introducing yet another configuration option to address that, this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure that the ``surrogateescape`` handler is enabled when the interpreter is required to make assumptions regarding the expected filesystem encoding. The aim of this behaviour is to attempt to ensure that operating system provided text values are typically able to be transparently passed through a Python 3 application even if it is incorrect in assuming that that text has been encoded as UTF-8. In particular, GB 18030 [12_] is a Chinese national text encoding standard that handles all Unicode code points, that is formally incompatible with both ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate escaped data - the points where GB 18030 reuses ASCII byte values in an incompatible way are likely to be invalid in UTF-8, and will therefore be escaped and opaque to string processing operations that split on or search for the relevant ASCII code points. Operations that don't involve splitting on or searching for particular ASCII or Unicode code point values are almost certain to work correctly. Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text processing operations that don't involve splitting on or searching for particular ASCII or Unicode code point values. As an example, consider two files, one encoded with UTF-8 (the default encoding for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for ``zh_CN.gb18030``):: $ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))' $ python3 -c 'open("gb18030.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))' On disk, we can see that these are two very different files:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \ print("GB18030:", open("gb18030.txt", "rb").read().strip())' UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n' GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n' That nevertheless can both be rendered correctly to the terminal as long as they're decoded prior to printing:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' UTF-8: ℙƴ☂ℌøἤ GB18030: ℙƴ☂ℌøἤ By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++ utilities will tend to do:: $ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt ℙƴ☂ℌøἤ �6�6�0�0�7�9�6�4�0�3�6�6 Even setting a specifically Chinese locale won't help in getting the GB-18030 encoded file rendered correctly:: $ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt ℙƴ☂ℌøἤ �6�6�0�0�7�9�6�4�0�3�6�6 The problem is that the *terminal* encoding setting remains UTF-8, regardless of the nominal locale. A GB18030 terminal can be emulated using the ``iconv`` utility:: $ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8 鈩櫰粹槀鈩屆羔激 ℙƴ☂ℌøἤ This reverses the problem, such that the GB18030 file is rendered correctly, but the UTF-8 file has been converted to unrelated hanzi characters, rather than the expected rendering of "Python" as non-ASCII characters. With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results in *both* files being displayed incorrectly:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \ | iconv -f GB18030 -t UTF-8 UTF-8: 鈩櫰粹槀鈩屆羔激 GB18030: 鈩櫰粹槀鈩屆羔激 However, setting the locale correctly means that the emulated GB18030 terminal now displays both files as originally intended:: $ LANG=zh_CN.gb18030 \ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \ | iconv -f GB18030 -t UTF-8 UTF-8: ℙƴ☂ℌøἤ GB18030: ℙƴ☂ℌøἤ The rationale for retaining ``surrogateescape`` as the default IO encoding is that it will preserve the following helpful behaviour in the C locale:: $ cat gb18030.txt \ | LANG=C python3 -c "import sys; print(sys.stdin.read())" \ | iconv -f GB18030 -t UTF-8 ℙƴ☂ℌøἤ Rather than reverting to the exception seen when a UTF-8 based locale is explicitly configured:: $ cat gb18030.txt \ | python3 -c "import sys; print(sys.stdin.read())" \ | iconv -f GB18030 -t UTF-8 Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib64/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently proposes would be to instead *always* default to ``surrogateescape`` on the standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request text encoding validation during stream processing. Adopting such an approach would bring Python 3 more into line with typical C/C++ tools that pass along the raw bytes without checking them for conformance to their nominal encoding, and would hence also make the last example display the desired output:: $ cat gb18030.txt \ | PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \ | iconv -f GB18030 -t UTF-8 ℙƴ☂ℌøἤ Dropping official support for ASCII based text handling in the legacy C locale ------------------------------------------------------------------------------ We've been trying to get strict bytes/text separation to work reliably in the legacy C locale for over a decade at this point. Not only haven't we been able to get it to work, neither has anyone else - the only viable alternatives identified have been to pass the bytes along verbatim without eagerly decoding them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540, Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR). While this PEP ensures that developers that need to do so can still opt-in to running their Python code in the legacy C locale, it also makes clear that we *don't* expect Python 3's Unicode handling to be reliable in that configuration, and the recommended alternative is to use a more appropriate locale setting. Providing implicit locale coercion only when running standalone --------------------------------------------------------------- Over the course of Python 3.x development, multiple attempts have been made to improve the handling of incorrect locale settings at the point where the Python interpreter is initialised. The problem that emerged is that this is ultimately *too late* in the interpreter startup process - data such as command line arguments and the contents of environment variables may have already been retrieved from the operating system and processed under the incorrect ASCII text encoding assumption well before ``Py_Initialize`` is called. The problems created by those inconsistencies were then even harder to diagnose and debug than those created by believing the operating system's claim that ASCII was a suitable encoding to use for operating system interfaces. This was the case even for the default CPython binary, let alone larger C/C++ applications that embed CPython as a scripting engine. The approach proposed in this PEP handles that problem by moving the locale coercion as early as possible in the interpreter startup sequence when running standalone: it takes place directly in the C-level ``main()`` function, even before calling in to the `Py_Main()`` library function that implements the features of the CPython interpreter CLI. The ``Py_Initialize`` API then only gains an explicit warning (emitted on ``stderr``) when it detects use of the ``C`` locale, and relies on the embedding application to specify something more reasonable. Querying LC_CTYPE for C locale detection ---------------------------------------- ``LC_CTYPE`` is the actual locale category that CPython relies on to drive the implicit decoding of environment variables, command line arguments, and other text values received from the operating system. As such, it makes sense to check it specifically when attempting to determine whether or not the current locale configuration is likely to cause Unicode handling problems. Setting both LANG & LC_ALL for C.UTF-8 locale coercion ------------------------------------------------------ Python is often used as a glue language, integrating other C/C++ ABI compatible components in the current process, and components written in arbitrary languages in subprocesses. Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all C/C++ components in the current process and in any subprocesses that inherit the current environment. This is important to handle cases where the problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is configured to forward locale settings, and the user logs into a Linux server). Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``. Together, these should ensure that when the locale coercion is activated, the switch to the C.UTF-8 locale will be applied consistently across the current process and any subprocesses that inherit the current environment. Allowing restoration of the legacy behaviour -------------------------------------------- The CPython command line interpreter is often used to investigate faults that occur in other applications that embed CPython, and those applications may still be using the C locale even after this PEP is implemented. Providing a simple on/off switch for the locale coercion behaviour makes it much easier to reproduce the behaviour of such applications for debugging purposes, as well as making it easier to reproduce the behaviour of older 3.x runtimes even when running a version with this change applied. Implementation ============== A draft implementation of the change (including test cases and documentation) is linked from issue 28180 [1_], which is an end user request that ``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``. This patch is now being maintained as the ``pep538-coerce-c-locale`` feature branch [18_] in Nick Coghlan's fork of the CPython repository on GitHub. NOTE: As discussed in [1_], the currently posted draft implementation has some known issues on Android. Backporting to earlier Python 3 releases ======================================== Backporting to Python 3.6.0 --------------------------- If this PEP is accepted for Python 3.7, redistributors backporting the change specifically to their initial Python 3.6.0 release will be both allowed and encouraged. However, such backports should only be undertaken either in conjunction with the changes needed to also provide a suitable locale by default, or else specifically for platforms where such a locale is already consistently available. Backporting to other 3.x releases --------------------------------- While the proposed behavioural change is seen primarily as a bug fix addressing Python 3's current misbehaviour in the default ASCII-based C locale, it still represents a reasonably significant change in the way CPython interacts with the C locale system. As such, while some redistributors may still choose to backport it to even earlier Python 3.x releases based on the needs and interests of their particular user base, this wouldn't be encouraged as a general practice. However, configuring Python 3 *environments* (such as base container images) to use these configuration settings by default is both allowed and recommended. Acknowledgements ================ The locale coercion approach proposed in this PEP is inspired directly by Armin Ronacher's handling of this problem in the ``click`` command line utility development framework [2_]:: $ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()' Traceback (most recent call last): ... RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Either run this under Python 2 or consult http://click.pocoo.org/python3/ for mitigation steps. This system supports the C.UTF-8 locale which is recommended. You might be able to resolve your issue by exporting the following environment variables: export LC_ALL=C.UTF-8 export LANG=C.UTF-8 The change was originally proposed as a downstream patch for Fedora's system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7 with a section allowing for backports to earlier versions by redistributors. The initial draft was posted to the Python Linux SIG for discussion [10_] and then amended based on both that discussion and Victor Stinner's work in PEP 540 [11_]. The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_]. Stephen Turnbull has long provided valuable insight into the text encoding handling challenges he regularly encounters at the University of Tsukuba (筑波大学). References ========== .. [1] CPython: sys.getfilesystemencoding() should default to utf-8 (http://bugs.python.org/issue28180) .. [2] Locale configuration required for click applications under Python 3 (http://click.pocoo.org/5/python3/#python-3-surrogate-handling) .. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale (https://bugzilla.redhat.com/show_bug.cgi?id=1404918) .. [4] GNU C: How Programs Set the Locale ( https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html) .. [5] GNU C: Locale Categories ( https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html) .. [6] glibc C.UTF-8 locale proposal (https://sourceware.org/glibc/wiki/Proposals/C.UTF-8) .. [7] GNOME Flatpak (http://flatpak.org/) .. [8] Ubuntu Snappy (https://www.ubuntu.com/desktop/snappy) .. [9] Pragmatic Unicode (http://nedbatchelder.com/text/unipain.html) .. [10] linux-sig discussion of initial PEP draft (https://mail.python.org/pipermail/linux-sig/2017-January/000014.html) .. [11] Feedback notes from linux-sig discussion and PEP 540 (https://github.com/python/peps/issues/171) .. [12] GB 18030 (https://en.wikipedia.org/wiki/GB_18030) .. [13] Shift-JIS (https://en.wikipedia.org/wiki/Shift_JIS) .. [14] ISO-2022 (https://en.wikipedia.org/wiki/ISO/IEC_2022) .. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale (https://bugs.python.org/issue19977) .. [16] test_readline.test_nonascii fails on Android (http://bugs.python.org/issue28997) .. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English" (http://bugs.python.org/issue18378#msg215215) .. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale`` ( https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-... ) Copyright ========= This document has been placed in the public domain under the terms of the CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/ -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

LGTM and I love this PEP and PEP 540. Some comments: ...
I prefer just "locale-aware" / "locale-independent" (application | library | function) to "locale-aware C/C++ application" / "C/C++ independent" here. Both of Rust and Node.JS are linked with libc. And Node.JS (v8) is written in C++. They just demonstrates many people prefer "always UTF-8" to "LC_CTYPE aware encoding" in real world application. And C/C++ can be used for locale-aware and locale-independent application. I can print "こんにちは、世界" in C locale, because stdio is byte transparent. There are many locale independent libraries written in C (zlib, libjpeg, etc..), and some functions in libc are locale-independent or LC_CTYPE independent (printf is locale-aware, but it uses LC_NUMERIC, not LC_CTYPE). ...
If it's really encouraged, how about providing patch officially, or backport it in 3.6.2 but disabled by default? Some Python users (including my company) uses pyenv or pythonz to build Python from source. This PEP and PEP 540 are important for them too.

On 6 March 2017 at 00:39, INADA Naoki <songofacandy@gmail.com> wrote:
Good point, I'll fix that in the next update.
For PEP 540, the changes are too intrusive to consider it a reasonable candidate for backporting to an earlier feature release, so for that aspect, we'll *all* be waiting for 3.7. For this PEP, while it's deliberately unobtrusive to make it more backporting friendly, 3.7 isn't *that* far away, and I didn't think to seriously pursue this approach until well after the 3.6 beta deadline for new features had passed. With it being clearly outside the normal bounds of what's appropriate for a cross-platform maintenance release, that means the only folks that can consider it for earlier releases are those building their own binaries for more constrained target environments. I can definitely make sure the patch is readily available for anyone that wants to apply it to their own builds, though (I'll upload it to both the Python tracker issue and the downstream Fedora Bugzilla entry). I also wouldn't completely close the door on the idea of classifying the change as a bug fix in CPython's handling of the C locale (and hence adding to a latter 3.6.x feature release), but I think the time to pursue that would be *after* we've had a chance to see how folks react to the redistributor customizations. I *think* it will be universally positive (because the status quo really is broken), but it also wouldn't be the first time I've learned something new and confusing about the locale subsystem only after releasing software that relied on an incorrect assumption about it :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 5 March 2017 at 17:50, Nick Coghlan <ncoghlan@gmail.com> wrote:
In terms of resolving this PEP, if Guido doesn't feel inclined to wade into the intricacies of legacy C locale handling, Barry has indicated he'd be happy to act as BDFL-Delegate :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 March 2017 at 07:58, Guido van Rossum <guido@python.org> wrote:
OK, I've added Barry to the PEP as BDFL-Delegate: https://github.com/python/peps/commit/4c46c5710031cac03a8d1ab7639272957998a1... Thanks for the quick response! Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

This is a very bad idea. It seems to based on an assumption that the C locale is always some kind of pathology. Admittedly, it sometimes is a result of misconfiguration or a mistake. (But I don't see why it's the interpreter's job to correct such mistakes.) However, in some cases the C locale is a normal environment for system services, cron scripts, distro package builds and whatnot. It's possible to write Python programs that are locale-agnostic. It's also possible to write programs that are locale-dependent, but handle ASCII as locale encoding gracefully. Or you might want to write a program that intentionally aborts with an explanatory error message when the locale encoding doesn't have sufficient Unicode coverage. ("Errors should never pass silently" anyone?) With this proposal, none of the above seems possible to correctly implement in Python. * Nick Coghlan <ncoghlan@gmail.com>, 2017-03-05, 17:50:
Setting LANGUAGE=en might be better, because it doesn't affect locale encoding either, and it works even when LC_ALL is set.
Calling the C locale "legacy" is a bit unfair, when there's even no agreement what the name of the successor is supposed to be... NB, both "C.UTF-8" and "C.utf8" work on Fedora, thanks to glibc normalizing the encoding part. Only "C.UTF-8" works on Debian, though, for whatever reason.
Sounds wrong. This will override all LC_*, even if they were originally set to something different that C.
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
Comma splice. s/set/was set/ would probably make it clearer.
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
Ditto.
Note that at least OpenBSD supports both "C.UTF-8" and "UTF-8" locales.
While this PEP ensures that developers that need to do so can still opt-in to running their Python code in the legacy C locale,
Yeah, no, it doesn't. It's impossible do disable coercion from Python code, because it happens to early. The best you can do is to write a wrapper script in a different language that sets PYTHONCOERCECLOCALE=0; but then you still get a spurious warning. -- Jakub Wilk

On 12 March 2017 at 08:36, Jakub Wilk <jwilk@jwilk.net> wrote:
An environment in which Python 3's eager decoding of operating system provided values to Unicode fails.
It's possible to write Python programs that are locale-agnostic.
If a program is genuinely locale-agnostic, it will be unaffected by this PEP.
It's also possible to write programs that are locale-dependent, but handle ASCII as locale encoding gracefully.
No, it is not generally feasible to write such programs in Python 3. That's the essence of the problem, and why the PEP deprecates support for the legacy C locale in Python 3.
This is what click does, but it only does it because that isn't possible for click to do the right thing given Python 3's eager decoding of various values as ASCII.
With this proposal, none of the above seems possible to correctly implement in Python.
The first case remains unchanged, the other two will need to use Python 2.7 or Tauthon. I'm fine with that.
It's not a spurious warning, as Python 3's Unicode handling for environmental interactions genuinely doesn't work properly in the legacy C locale (unless you're genuinely promising to only ever feed it ASCII values, but that isn't a realistic guarantee to make). However, I'm also open to having that particular setting also disable the runtime warning from the shared library. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 March 2017 at 22:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
However, I'm also open to having [PYTHONCOERCECLOCALE=0] also disable the runtime warning from the shared library.
Considering this a little further, I think this is going to be necessary in order to sensibly handle the build time "--with[out]-c-locale-warning" flag in the test suite. Currently, there are a number of tests beyond the new ones in Lib/test/test_locale_coercion.py that would need to know whether or not to expect to see a warning in subprocesses in order to correctly handle the "--without-c-locale-warning" case: https://github.com/ncoghlan/cpython/commit/78c17a7cea04aed7cd1fce8ae5afb085a... If PYTHONCOERCECLOCALE=0 turned off the runtime warning as well, then the behaviour of those tests would remain independent of the build flag as long as they set the new environment variable in the child process - the warning would be disabled either at build time via "--without-c-locale-warning" or at runtime with "PYTHONCOERCECLOCALE=0". The check for the runtime C locale warning would then be added to _testembed rather than going through a normal Python subprocess, and that test would be the only one that needed to know whether or not the locale warning had been disabled at build time (which we could indicate simply by compiling the embedding part of the test differently in that case). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

I think "C locale + use UTF-8 for stdio + fs" is common setup, especially for servers. It's not mistake or misconfiguration. Perl, Ruby, Rust, Node.JS and Go can use UTF-8 without any pain on C locale. And current Python is painful for such cases. So I strongly +1 for PEP 540 (UTF-8 mode). On the other hand, PEP 538 is for for locale-dependent libraries (like curses) and subprocesses. I agree C locale is misconfiguration if user want to use UTF-8 in locale-dependent libraries. And I agree current PEP 538 seems carrying it a bit too far. But locale coercing works nice on platforms like android. So how about simplified version of PEP 538? Just adding configure option for locale coercing which is disabled by default. No envvar options and no warnings. Regards,

On 13 March 2017 at 18:37, INADA Naoki <songofacandy@gmail.com> wrote:
That doesn't solve my original Linux distro problem, where locale misconfiguration problems show up as "Python 2 works, Python 3 doesn't work" behaviour and bug reports. The problem is that where Python 2 was largely locale-independent by default (just passing raw bytes through) such that you'd only get immediate encoding or decoding errors if you had a Unicode literal or a decode() call somewhere in your code and would otherwise pass data corruption problems further down the chain, Python 3 is locale-*aware* by default, and eagerly decodes: - command line parameters - environment variables - responses from operating system API calls - standard stream input - file contents You *can* still write locale-independent Python 3 applications, but they involve sprinkling liberal doses of "b" prefixes and suffixes and mode settings and "surrogateescape" error handler declarations in various places - you can't just run python-modernize over a pre-existing Python 2 application and expect it to behave the same way in the C locale as it did before. Once implemented, PEP 540 will partially solve the problem by introducing a locale independent UTF-8 mode, but that still leaves the inconsistency with other locale-aware components that are needing to deal with Python 3 API calls that accept or return Unicode objects where Python 2 allowed the use of 8-bit strings. Folks that really want the old behaviour back will be able to set PYTHONCOERCECLOCALE=0 (as that no longer emits any warnings), or else build their own CPython from source using `--without-c-locale-coercion` and ``--without-c-locale-warning`. However, they'll also get the explicit support notification from PEP 11 that any Unicode handling bugs they run into in those configurations are entirely their own problem - we won't fix them, because we consider those configurations unsupportable in the general case. That puts the additional self-support burden on folks doing something unusual (i.e. insisting on running an ASCII-only environment in 2017), rather than on those with a more conventional use case (i.e. running an up to date \*nix OS using UTF-8 or another universal encoding for both local and remote interfaces). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Mar 13, 2017 at 8:01 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sorry, I meant "PEP 540 + Simplified PEP 538 (coercing by configure option)". distros can enable the configure option, off course.
I feel problems PEP 538 solves, but PEP 540 doesn't solve are relatively small compared with complexity introduced PEP 538. As my understanding, PEP 538 solves problems only when: * python executable is used. (GUI applications linking Python for plugin is not affected) * One of C.UTF-8, C.utf8 or UTF8 is accepted for LC_CTYPE. * The "locale aware components" uses something other than ASCII or UTF-8 on C locale, but uses UTF-8 on UTF-8 locale. Can't we reduce options from 3 (2 configure, 1 envvar) when PEP 540 is accepted too?

On Mon, Mar 13, 2017 at 10:31 PM, Random832 <random832@fastmail.com> wrote:
Yes. people who building Python understand about the platform than users in most cases. For android build, they know coercing is works well on android. For Linux distros, they know the system supports locales like C.UTF-8 or not, and there are any python-xxxx packages which may cause the problem and coercing solve it. For people who building Python themselves (in docker, pyenv, etc...) They knows how they use the Python.

On 13 March 2017 at 23:31, Random832 <random832@fastmail.com> wrote:
Distro packagers have narrower user bases and a better known set of compatibility constraints than upstream, so kicking platform integration related config decisions downstream to us(/them) is actually a pretty reasonable thing for upstream to do :) For example, while I've been iterating on the reference implementation for 3.7, Charalampos Stratakis has been iterating on the backport patch for Fedora 26, and he's found that we really need the PEP's "disable the C locale warning" config option to turn off the CLI's coercion warning in addition to the warning in the shared library, as leaving it visible breaks build processes for other packages that check that there aren't any messages being emitted to stderr (or otherwise care about the exact output from build tools that rely on the system Python 3 runtime). However, when it comes to choosing the upstream config defaults, it's important to keep in mind that one of the explicit goals of the PEP is to modify PEP 11 to *formally drop upstream support* for running Python 3 in the legacy C locale without using PEP 538, PEP 540 or a combination of the two to assume UTF-8 instead of ASCII for system interfaces. It's not that you *can't* run Python 3 in that kind of environment, and it's not that there are never any valid reasons to do so. It's that lots of things that you'd typically expect to work are going to misbehave (one I discovered myself yesterday is that the GNU readline problems reported in interactive mode on Android also show up when you do either "LANG=C python2" or "LANG=C python3" on traditional Linux and attempt to *edit* lines containing multi-byte characters), so you really need to know what you're doing in order to operate under those constraints. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Mar 14, 2017, at 10:17, Nick Coghlan wrote:
It occurs to me that (at least for readline... and maybe also as a general proxy for whether the rest should be done) detecting the IUTF8 terminal flag (which, properly, controls basic non-readline-based line editing such as backspace) may be worthwhile. (And maybe Readline itself should be doing this, more or less independent of Python. But that's a discussion for elsewhere)

On 15 March 2017 at 00:17, Nick Coghlan <ncoghlan@gmail.com> wrote:
The build processes that broke due to the warning were judged to be a bug in autoconf rather than a problem with the warning itself: http://git.savannah.gnu.org/gitweb/?p=autoconf-archive.git;a=commit;h=883a2a... So we're going to leave this as it is in the PEP for now (i.e. the locale coercion warning always happens unless you preconfigure a locale other than C), but keep an eye on it to see if it causes any other problems. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

There was a bunch of discussion about all this a while back, in which I think these points were addressed: However, in some cases the C locale is a normal environment for system
services, cron scripts, distro package builds and whatnot.
Indeed it is. But: if you run a Python (or any) program that is expecting an ASCII-only locale, then it will work jsut fine with any ascii-compatible locale. -- so no problem there. On the other hand, if you run a program that is expectign a unicode-aware locale, then it might barf unexpectently if run on a ASCII-only locale. A lot of people do in fiact have these issues (which are due to mis-configuration of the host system, which is indeed not properly Python's problem). So if we do all this, then: A) mis-configured systems will magically work (sometimes) This is a Good Thing. and B) If someone runs a python program that is expecting Unicode support on an properly configured ASCII-only system, then it will mostly "just work" -- after all a lot of C APIs are simply char*, who cares what the encoding is? It would not, however, fail if when a non-ascii value is used somewhere it shouldn't. So the question nis -- is anyone counting on errors in this case? i.e., is a sysadmin thinking: "I want an ASCII-only system, so I'll set the locale, and now I can expect any program running on this system that is not ascii compatible to fail." I honestly don't know if this is common -- but I would argue that trying to run a unicode-aware program on an ASCII-only system could be considered a mis-configuration as well. Also -- many programs will just be writing bytes to the system without checking encoding anyway. So this would simply let Python3 programs behave like most others... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 15 March 2017 at 06:22, Chris Barker <chris.barker@noaa.gov> wrote:
the assumed default, rather than "C". Even glibc itself would quite like to get to a point where you only get the C locale if you explicitly ask for it: https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 The main practical objection that comes up in relation to "UTF-8 everywhere" isn't to do with UTF-8 per se, but rather with the size of the collation tables needed to do "proper" sorting of Unicode code points. However, there's a neat hack in the design of UTF-8 where sorting the encoded bytes by byte value is equivalent to sorting the decoded text by the Unicode code point values, which means that "LC_COLLATE=C" sorting by byte value, and "LC_COLLATE=C.UTF-8" sorting by "Unicode code point value" give the same results. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mar 15, 2017, at 12:29 PM, Nick Coghlan wrote:
I think it's still the case that some isolation environments (e.g. Debian chroots) default to bare C locales. Often it doesn't matter, but sometimes tests or other applications run inside those environments will fail in ways they don't in a normal execution environment. The answer is almost always to explicitly coerce those environments to C.UTF-8 for Linuxes that support that. -Barry

On 16 March 2017 at 00:30, Barry Warsaw <barry@python.org> wrote:
Yeah, I think mock (the Fedora/RHEL/CentOS build environment for RPMs) still defaults to a bare C locale, and Docker environments usually aren't systemd-managed in the first place (since PID 1 inside a container typically isn't an init system at all). The general trend for all of those seems to be "they don't use C.UTF-8... yet", though (even though some of them may not shift until the default changes at the level of the given distro's libc implementation). The answer is almost always to
explicitly coerce those environments to C.UTF-8 for Linuxes that support that.
I also double checked that "LANG=C ./python -m test" still worked with the reference implementation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick and all core devs who are interested in this PEP. I'm reviewing PEP 538 and I want to accept it in this month. It will reduces much UnicodeError pains which server-side OPs facing. Thank you Nick for working on this PEP. If you have something worrying about this PEP, please post a comment soon. If you don't have enough time to read entire this PEP, feel free to ask a question about you're worrying. Here is my comments:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html says:
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
I don't know about .NET runtime on Unix much. (mono and .NET Core). "Go, Node.js and Rust" seems enough examples.
"locale warning" means warning printed when C locale is used, am I right? As my understanding, "locale warning" is shown in these cases (all cases implies under C locale and PYTHONUTF8 is not enabled). a. C locale is used and locale coercion is disabled by ``--without-c-locale-coercion`` configure option. b. locale coercion is failed since there is none of C.UTF-8, C.utf8, nor UTF-8 locale. c. Python is embedded. locale coercion can't be used in this case. In case of (b), while warning about C locale is not shown, warning about coercion is still shown. So when people don't want to see warning under C locale and there is no (C.UTF-8, C.utf8, UTF-8) locales, there are three ways: * Set PYTHONUTF=1 (if PEP 540 is accepted) * Set PYTHONCOERCECLOCALE=0. * Use both of ``--without-c-locale-coercion`` and ``--without-c-locale-warning`` configure options. Is my understanding right? BTW, I prefer PEP 540 provides ``--with-utf8mode`` option which enables UTF-8 mode by default. And if it is added, there are too few use cases for ``--without-c-locale-warning``. There are some use cases people want to use UTF-8 by default in system wide. (e.g. container, webserver in Cent OS, etc...) On the other hand, most of C locale usage are "per application" basis, rather than "system wide." configure option is not suitable for such per application setting, off course. But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined.
Why ``--with[out]-c-locale-coercion`` have no effect on macOS, iOS and Android? On Android, locale coercion fixes readline. Do you mean locale coercion happen always regardless this configuration option? On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to ``ascii:surrogateescape``? Even so, locale coercion may fix libraries like readline, curses. While C locale is less common on macOS, I don't understand any reason to disable it on macOS. I know almost nothing about iOS, but it's similar to Android or macOS in my expectation.
Improving the handling of the C locale --------------------------------------
...
JVM and .NET examples are misleading again. They just use UTF-16-LE for syscall on Windows, like Python. I don't know about them much, but I believe they don't use UTF-16 for system encoding on Linux.
I agree that this PEP shouldn't break byte transparent behavior in C locale by coercing. But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8 locale can be pitfall. I read following part of the section and I agree that there is no way to solve all issue. But how about using surrogateescape handler in C.* locales like C locale? It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least. Anyway, I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python applications which requires byte transparent stdio can use `set_encoding(errors="surrogateescape")` explicitly. Regards,

On 4 May 2017 at 12:24, INADA Naoki <songofacandy@gmail.com> wrote:
I'll push an update to drop the JVM and .NET from the list of examples.
Yes, that sounds right.
Yeah, in addition to Barry requesting such an option in one of the earlier linux-sig reviews, my main rationale for including it is that providing both config options offers a quick compatibility fix for any distro where emitting the coercion and/or C locale warning on stderr causes problems. The only one of those that Fedora encountered in the F26 alpha was deemed a bug in the affected application (something in autotools was checking for "no output on stderr" instead of "subprocess exit code is 0", and the fix was to switch it to check the subprocess exit code), but there are enough Linux distros and BSD variants out there that I'm a lot more comfortable shipping the change with straightforward "off" switches for the stderr output.
But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
+1.
On these three, we know the system encoding is UTF-8, so we never interpreted the C locale as meaning "ascii" in the first place.
Right, the change for Android is that we switch to calling 'setlocale(LC_ALL, "C.UTF-8")' during interpreter startup instead of 'setlocale(LC_ALL, "")'. That change is guarded by "#ifdef __ANDROID__", rather than either of the new conditionals.
On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to ``ascii:surrogateescape``?
Similar to Android, CPython itself is hardcoded to assume UTF-8 on Mac OS X, since that's a platform API guarantee that users can't change.
My understanding is that other libraries and applications also automatically use UTF-8 for system interfaces on Mac OS X and iOS. It could be that that understanding is wrong, and locale coercion would provide a benefit there as well. (Checking the draft implementation, it turns out I haven't actually implemented the configure logic to make those config settings platform dependent yet - they're currently only undefined on Windows by default, since that doesn't use the autotools based build system)
Sorry, this was ambiguous - it's meant to refer to applications calling in to the JVM or CLR app runtime, not to the JVM or CLR calling out to the host operating system. I'll try to make it clearer in the next update.
That would be entirely possible, as the code responsible for that adjustment is the lines: char *loc = setlocale(LC_CTYPE, NULL); if (loc != NULL && strcmp(loc, "C") == 0) errors = "surrogateescape"; Changing that to include "C.UTF-8" as a second locale that also implies the use of `surrogateescape` would be low risk, and means we wouldn't need to call Py_SetStandardStreamEncoding. As a result, non UTF-8 data (such as latin-1 or GB-18030) would automatically round-trip, regardless of whether C.UTF-8 was explicitly set as the locale, or reached as the result of locale coercion.
It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least.
It will also extend host/container encoding mismatch compatibility to containers that explicitly set the C.UTF-8 locale. That makes me more confident in making that change, as it would be rather counterproductive if our changes gave base image developers an incentive *not* to set C.UTF-8 as their default locale.
Agreed. Cheers, Nick. P.S. I've pushed the JVM/CLR related clarifications, but the standard stream changes will require a bit more thought and corresponding updates to the reference implementation - I'll aim to get to that this weekend. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

I tried Python 3.6 on macOS 10.11 El Capitan. $ LANG=C python3 -c 'import locale; print(locale.getpreferredencoding())' US-ASCII And interactive shell (which uses readline by default) doesn't accept non-ASCII input anymore. https://www.dropbox.com/s/otshuzhnw7a71n5/macos-c-locale-readline.gif?dl=0 I think many problems with C locale are same on macOS too. So I don't think no special casing is required on macOS. Regards,

On Thu, 4 May 2017 11:24:27 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
From my POV, it is problematic that the behaviour outlined in PEP 538 (see Abstract section) varies depending on the adoption of another PEP (PEP 540). If we want to adopt PEP 538 before pronouncing on PEP 540, then PEP 538 should remove all points conditional on PEP 540 adoption, and PEP 540 should later be changed to adopt those removed points as PEP 540-specific changes. Regards Antoine.

On 5 May 2017 at 02:25, Antoine Pitrou <solipsis@pitrou.net> wrote:
While I won't be certain until I update the PEP and reference implementation, I'm pretty sure Inada-san's suggestion to replace the call to Py_SetStandardStreamEncoding with defaulting to surrogateescape on the standard streams in the C.UTF-8 locale will remove this current dependency between the PEPs as well as making the "C.UTF-8 locale" and "C locale coerced to C.UTF-8" behaviour indistinguishable at runtime (aside from the stderr warning in the latter case). It will then be up to Victor to state in PEP 540 how locale coercion will interact with Python UTF-8 mode (with my recommendation being the one currently in PEP 538: it should implicitly set the environment variable, so the mode activation is inherited by subprocesses) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, May 4, 2017 at 6:25 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
This is kind of an aside, but regardless of the dependency relationship between PEP 538 and 540, given that they kind of go hand-in-hand would it make sense to rename them--e.g. have PEP 539 and PEP 540 trade places, since PEP 539 has nothing to do with this and is awkwardly nestled between them. Or would that only confuse matters at this point? Thanks, Erik

On 5 May 2017 at 19:45, Erik Bray <erik.m.bray@gmail.com> wrote:
While we have renumbered PEPs in the past, it was only in cases where the PEPs were relatively new, so there weren't many discussions referencing them under their existing numbers. In this case, both PEP 539 and 540 have already been discussed extensively, so renumbering them would cause problems without providing any corresponding benefit (Python's development is sufficiently high volume that it isn't unusual for related PEPs to have non-sequential PEP numbers) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 5 May 2017 at 23:21, INADA Naoki <songofacandy@gmail.com> wrote:
Don't forget that Victor's still working on the design of PEP 540, so it isn't ready for pronouncement yet. Antoine's request was for me to update PEP *538* to eliminate the "this will need to change if PEP 540 is accepted" aspects, and I think your suggestion to make the "C.UTF-8 -> surrogateescape on standard streams by default" behaviour independent of the locale coercion will achieve that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick. After thinking about relationship between PEP 538 and 540 in two days, I came up with idea which removes locale coercion by default from PEP 538, it does just enables UTF-8 mode and show warning about C locale. Of course, this idea is based on PEP 540. There are no "If PEP 540 is rejected". How do you think? If it make sense, I want to postpone PEP 538 until PEP 540 is accepted or rejected, or merge PEP 538 into PEP 540. ## Background Locale coercion in current PEP 538 has some downsides: * If user set `LANG=C LC_DATE=ja_JP.UTF-8`, locale coercion may overrides LC_DATE. * It makes behavior divergence between standalone and embedded Python. * Parent Python process may use utf-8:surrogateescape, but child process Python may use utf-8:strict. (Python 3.6 uses ascii:surrogateescape in both of parent and children). On the other hand, benefits from locale coercion is restricted: * When locale coercion succeeds, warning is always shown. To hide the warning, user must disable coercion in some way. (e.g. use UTF-8 locale explicitly, or set PYTHONCOERCECLOCALE=0). So I feel benefit / complexity ratio of locale coercion is less than UTF-8 mode. But locale coercion works nice on Android. And there are some Android-like Unix systems (container or small device) that C.UTF-8 is always proper locale. ## Rough spec * Make Android-style locale coercion (forced, no warning) is now build option. Some users who build Python for container or small device may like it. * Normal Python build doesn't change locale. When python executable is run in C locale, show locale warning. locale warning can be disabled as current PEP 538. * User can disable automatic UTF-8 mode by setting PYTHONUTF8=0 environment variables. User can hide warning by setting PYTHONUTF8=1 too. On Fri, May 5, 2017 at 10:21 PM, INADA Naoki <songofacandy@gmail.com> wrote:

On 7 May 2017 at 15:22, INADA Naoki <songofacandy@gmail.com> wrote:
The main problems I see with this approach are: 1. There's no way to configure earlier Python versions to emulate PEP 540. It's a completely new mode of operation. 2. PEP 540 isn't actually defined yet (Victor is still working on it) 3. Due to 1&2, PEP 540 isn't something 3.6 redistributors can experiment with backporting to a narrower target audience By contrast, you can emulate PEP 538 all the way back to Python 3.1 by setting the following environment variables: LC_ALL=C.UTF-8 LANG=C.UTF-8 PYTHONIOENCODING=utf-8:surrogateescape (assuming your platform provides a C.UTF-8 locale and you don't need to run any Python 2.x components in that same environment) I think the specific concerns you raise below are valid though, and I'd be happy to amend PEP 538 to address them all.
The fact it sets "LC_ALL" has previously been raised as a concern with PEP 538, so it probably makes sense to drop that aspect and just override "LANG". The scenarios where it makes a difference are incredibly obscure (involving non-default SSH locale forwarding settings for folks using SSH on Mac OS X to connect to remote Linux systems), while just setting "LANG" will be sufficient to address the "LANG=C" case that is the main driver for the PEP. That means in the case above, the specific LC_DATE setting would still take precedence.
* It makes behavior divergence between standalone and embedded Python.
Such divergence already exists, only in the other direction: embedding applications may override the runtime's default settings, either by setting a particular locale, or by using Py_SetStandardStreamEncoding (which was added specifically to make it easy for Blender to force the use of UTF-8 on the embedded Python's standard streams, regardless of the currently locale) That said, this is also the rationale for my suggestion that we expose locale coercion as a public API: if (Py_LegacyLocaleDetected()) { Py_CoerceLegacyLocale(); } That would make it straightforward for any embedding application that wanted to do so to replicate the behaviour of the standard CLI. The level of divergence is also mitigated by the point in the next section.
This discrepancy is gone now thanks to your suggestion of making "surrogateescape" the default standard stream handler when one of the coercion target locales is explicitly configured - both parent processes and child processes end up with "utf-8:surrogateescape" configured on the standard streams.
The current warning is based on what we think is appropriate for Fedora downstream, but that doesn't necessarily mean its the right approach for Python upstream, especially if the LC_ALL override is dropped. We could also opt for a model where Python 3.7 emits the coercion warning, but Python 3.8 just does the coercion silently (that rationale would then also apply to PEP 540 - we'd warn on stderr about the change in default behaviour in 3.7, but take the new behaviour for granted in 3.8). The change to make the standard stream error handler setting depend solely on the currently configured locale also helps here, since it means it doesn't matter how a process reached the state of having the locale set to "C.UTF-8". CPython will behave the same way regardless, so it makes it less import to provide an explicit notice that coercion took place.
So I feel benefit / complexity ratio of locale coercion is less than UTF-8 mode.
It isn't an either/or though - we're entirely free to do both, one based solely on the existing configuration options that have been around since 3.1, and the other going beyond those to also adjust the default behaviour of other interfaces (like "open()").
But do we *want* to support the legacy C locale in 3.7+? I don't think we do, because it will never work properly for our purposes as long as it assumes ASCII as the default text encoding. Part of the motivation for making locale coercion the default is so we can update PEP 11 to make it clear that running in the legacy C locale is no longer an officially supported configuration.
That still pushes the problem back on end users to fix, though, rather than just automatically making things like GNU readline integration work.
I think I need to better explain in the PEP why PEP 540's UTF-8 mode on its own won't be enough, as it doesn't necessarily handle locale-aware extension modules like GNU readline (this came up in the draft PR review, but I never added anything specifically to the PEP about it), and also doesn't help at all with invocation of older 3.x releases in a subprocess. Here's an interactive session from a PEP 538 enabled CPython, where each line after the first is executed by doing "up-arrow, 4xleft-arrow, delete, enter" $ LANG=C ./python Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behavior). Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Not exactly exciting, but this is what currently happens on an older release if you only change the Python level stream encoding settings without updating the locale settings: $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌ�") File "<stdin>", line 0 ^ SyntaxError: 'utf-8' codec can't decode bytes in position 20-21: invalid continuation byte That particular misbehaviour is coming from GNU readline, *not* CPython - because the editing wasn't UTF-8 aware, it corrupted the history buffer and fed such nonsense to stdin that even the surrogateescape error handler was bypassed. While PEP 540's UTF-8 mode could technically be updated to also reconfigure readline, that's *one* extension module, and only when it's running directly as part of Python 3.7. By contrast, using a more appropriate locale setting already gets readline to play nice, even when its running inside Python 3.5: $ LANG=C.UTF-8 python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Don't get me wrong, I'm definitely a fan of PEP 540, as it extends much of what PEP 538 covers beyond the standard streams and also applies it to other operating system interfaces without relying on the underlying operating system to provide a UTF-8 based locale. However, I also expect it to be plagued by extension module compatibility issues if folks attempt to use it standalone, without locale coercion to reconfigure the behaviour of extension modules appropriately. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8 May 2017 at 15:34, Nick Coghlan <ncoghlan@gmail.com> wrote:
It occurs to me we can even still handle the forwarded "LC_CTYPE=UTF-8" case by changing the locale coercion to set LC_CTYPE & LANG, rather than just setting LANG as I suggested above. That way `LANG=C LC_DATE=ja_JP.UTF-8` would still respect the explicit LC_DATE setting, `LC_CTYPE=C` would be handled the same way as `LANG=C`, and LC_ALL=C would continue to provide a way to force the C locale even for LC_CTYPE without needing to be aware of the Python specific PYTHONCOERCECLOCALE setting. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 May 2017 at 13:44, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've posted an updated reference implementation that works this way, and it turned out to have some rather nice benefits: not only did it make the handling of full locales (C.UTF-8, C.utf8) and partial locales (UTF-8) more consistent (allowing for a net deletion of code), it also meant I no longer needed a custom test case in _testembed to check the locale warning. Instead, the affected test cases now just set "LC_ALL" as a locale override that switches off CPython's locale coercion without also switching off the locale warning. Code changes: https://github.com/ncoghlan/cpython/commit/476a78133c94d82e19b89f50036cecd9b... Rather than posting the PEP updates here though, I'll start a new thread that explains what has changed since my initial posting to python-dev back in March. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Mar 4, 2017 at 11:50 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It feels like having a short section on the caveats of this approach would help to introduce this section. Something that says that this PEP can cause a split in how Python behaves in non-sandalone applications (mod_wsgi, IDEs where libpython is compiled in, etc) vs standalone (unless the embedders take similar steps as standalone python is doing). Then go on to state that this approach was still chosen as coercing in Py_Initialize is too late, causing the inconsistencies and problems listed here. -Toshio

On 5 March 2017 at 17:50, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've just pushed a significant update to the PEP based on the discussions in this thread: https://github.com/python/peps/commit/2fb53e7c1bbb04e1321bca11cc0112aec69f63... The main change at the technical level is to modify the handling of the coercion target locales such that they *always* lead to "surrogateescape" being used by default on the standard streams. That means we don't need to call "Py_SetStandardStreamEncoding" during startup, that subprocesses will behave the same way as their parent processes, and that Python in Linux containers will behave consistently regardless of whether the container locale is set to "C.UTF-8" explicitly, or is set to "C" and then coerced to "C.UTF-8" by CPython. That change also eliminated the behaviour that was contingent on whether or not PEP 540 was accepted - PEP 540 may still want to have the coercion target locales imply full UTF-8 mode rather than just setting the stream error handler differently, but that will be a question to be considered when reviewing PEP 540 rather than needing to worry about it now. The second technical change is that the locale coercion and warning are now enabled on Android and Mac OS X. For Android, that's a matter of getting GNU readline to behave sensibly, while for Mac OS X, it's a matter of simplifying the implementation and improving cross-platform behavioural consistency (even though we don't expect the coercion to actually have much impact there). Beyond that, the PEP update focuses on clarifying a few other points without actually changing the proposal. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 May 2017 at 18:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
Working on the revised implementation for this, I've ended up refactoring it so that all the heavy lifting is done by a single function exported from the shared library: "_Py_CoerceLegacyLocale()". The CLI code then just contains the check that says "Are we running in the legacy C locale? If so, call _Py_CoerceLegacyLocale()", with all the details of how the coercion actually works being hidden away inside pylifecycle.c. That seems like a potential opportunity to make the 3.7 version of this a public API, using the following pattern: if (Py_LegacyLocaleDetected()) { Py_CoerceLegacyLocale(); } That way applications embedding CPython that wanted to implement the same locale coercion logic would have an easy way to do so. Thoughts? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 May 2017 at 18:33, Nick Coghlan <ncoghlan@gmail.com> wrote:
OK, the reference implementation has been updated to match the latest version of the PEP: https://github.com/ncoghlan/cpython/commit/188e7807b6d9e49377aacbb287c074e5c... For now, the implementation in the standalone CLI looks like this: /* [snip] */ extern int _Py_LegacyLocaleDetected(void); extern void _Py_CoerceLegacyLocale(void); /* [snip] */ if (_Py_LegacyLocaleDetected()) { _Py_CoerceLegacyLocale(); } If we decide to make this a public API for 3.7, the necessary changes would be: - remove the leading underscore from the function names - add the function prototypes to the pylifecycle.h header - add the APIs to the C API documentation in the configuration & initialization section - define the APIs in the PEP - adjust the backport note in the PEP to say that backports should NOT expose the public C API, but keep it private Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

LGTM and I love this PEP and PEP 540. Some comments: ...
I prefer just "locale-aware" / "locale-independent" (application | library | function) to "locale-aware C/C++ application" / "C/C++ independent" here. Both of Rust and Node.JS are linked with libc. And Node.JS (v8) is written in C++. They just demonstrates many people prefer "always UTF-8" to "LC_CTYPE aware encoding" in real world application. And C/C++ can be used for locale-aware and locale-independent application. I can print "こんにちは、世界" in C locale, because stdio is byte transparent. There are many locale independent libraries written in C (zlib, libjpeg, etc..), and some functions in libc are locale-independent or LC_CTYPE independent (printf is locale-aware, but it uses LC_NUMERIC, not LC_CTYPE). ...
If it's really encouraged, how about providing patch officially, or backport it in 3.6.2 but disabled by default? Some Python users (including my company) uses pyenv or pythonz to build Python from source. This PEP and PEP 540 are important for them too.

On 6 March 2017 at 00:39, INADA Naoki <songofacandy@gmail.com> wrote:
Good point, I'll fix that in the next update.
For PEP 540, the changes are too intrusive to consider it a reasonable candidate for backporting to an earlier feature release, so for that aspect, we'll *all* be waiting for 3.7. For this PEP, while it's deliberately unobtrusive to make it more backporting friendly, 3.7 isn't *that* far away, and I didn't think to seriously pursue this approach until well after the 3.6 beta deadline for new features had passed. With it being clearly outside the normal bounds of what's appropriate for a cross-platform maintenance release, that means the only folks that can consider it for earlier releases are those building their own binaries for more constrained target environments. I can definitely make sure the patch is readily available for anyone that wants to apply it to their own builds, though (I'll upload it to both the Python tracker issue and the downstream Fedora Bugzilla entry). I also wouldn't completely close the door on the idea of classifying the change as a bug fix in CPython's handling of the C locale (and hence adding to a latter 3.6.x feature release), but I think the time to pursue that would be *after* we've had a chance to see how folks react to the redistributor customizations. I *think* it will be universally positive (because the status quo really is broken), but it also wouldn't be the first time I've learned something new and confusing about the locale subsystem only after releasing software that relied on an incorrect assumption about it :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 5 March 2017 at 17:50, Nick Coghlan <ncoghlan@gmail.com> wrote:
In terms of resolving this PEP, if Guido doesn't feel inclined to wade into the intricacies of legacy C locale handling, Barry has indicated he'd be happy to act as BDFL-Delegate :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 March 2017 at 07:58, Guido van Rossum <guido@python.org> wrote:
OK, I've added Barry to the PEP as BDFL-Delegate: https://github.com/python/peps/commit/4c46c5710031cac03a8d1ab7639272957998a1... Thanks for the quick response! Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

This is a very bad idea. It seems to based on an assumption that the C locale is always some kind of pathology. Admittedly, it sometimes is a result of misconfiguration or a mistake. (But I don't see why it's the interpreter's job to correct such mistakes.) However, in some cases the C locale is a normal environment for system services, cron scripts, distro package builds and whatnot. It's possible to write Python programs that are locale-agnostic. It's also possible to write programs that are locale-dependent, but handle ASCII as locale encoding gracefully. Or you might want to write a program that intentionally aborts with an explanatory error message when the locale encoding doesn't have sufficient Unicode coverage. ("Errors should never pass silently" anyone?) With this proposal, none of the above seems possible to correctly implement in Python. * Nick Coghlan <ncoghlan@gmail.com>, 2017-03-05, 17:50:
Setting LANGUAGE=en might be better, because it doesn't affect locale encoding either, and it works even when LC_ALL is set.
Calling the C locale "legacy" is a bit unfair, when there's even no agreement what the name of the successor is supposed to be... NB, both "C.UTF-8" and "C.utf8" work on Fedora, thanks to glibc normalizing the encoding part. Only "C.UTF-8" works on Debian, though, for whatever reason.
Sounds wrong. This will override all LC_*, even if they were originally set to something different that C.
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
Comma splice. s/set/was set/ would probably make it clearer.
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
Ditto.
Note that at least OpenBSD supports both "C.UTF-8" and "UTF-8" locales.
While this PEP ensures that developers that need to do so can still opt-in to running their Python code in the legacy C locale,
Yeah, no, it doesn't. It's impossible do disable coercion from Python code, because it happens to early. The best you can do is to write a wrapper script in a different language that sets PYTHONCOERCECLOCALE=0; but then you still get a spurious warning. -- Jakub Wilk

On 12 March 2017 at 08:36, Jakub Wilk <jwilk@jwilk.net> wrote:
An environment in which Python 3's eager decoding of operating system provided values to Unicode fails.
It's possible to write Python programs that are locale-agnostic.
If a program is genuinely locale-agnostic, it will be unaffected by this PEP.
It's also possible to write programs that are locale-dependent, but handle ASCII as locale encoding gracefully.
No, it is not generally feasible to write such programs in Python 3. That's the essence of the problem, and why the PEP deprecates support for the legacy C locale in Python 3.
This is what click does, but it only does it because that isn't possible for click to do the right thing given Python 3's eager decoding of various values as ASCII.
With this proposal, none of the above seems possible to correctly implement in Python.
The first case remains unchanged, the other two will need to use Python 2.7 or Tauthon. I'm fine with that.
It's not a spurious warning, as Python 3's Unicode handling for environmental interactions genuinely doesn't work properly in the legacy C locale (unless you're genuinely promising to only ever feed it ASCII values, but that isn't a realistic guarantee to make). However, I'm also open to having that particular setting also disable the runtime warning from the shared library. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 March 2017 at 22:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
However, I'm also open to having [PYTHONCOERCECLOCALE=0] also disable the runtime warning from the shared library.
Considering this a little further, I think this is going to be necessary in order to sensibly handle the build time "--with[out]-c-locale-warning" flag in the test suite. Currently, there are a number of tests beyond the new ones in Lib/test/test_locale_coercion.py that would need to know whether or not to expect to see a warning in subprocesses in order to correctly handle the "--without-c-locale-warning" case: https://github.com/ncoghlan/cpython/commit/78c17a7cea04aed7cd1fce8ae5afb085a... If PYTHONCOERCECLOCALE=0 turned off the runtime warning as well, then the behaviour of those tests would remain independent of the build flag as long as they set the new environment variable in the child process - the warning would be disabled either at build time via "--without-c-locale-warning" or at runtime with "PYTHONCOERCECLOCALE=0". The check for the runtime C locale warning would then be added to _testembed rather than going through a normal Python subprocess, and that test would be the only one that needed to know whether or not the locale warning had been disabled at build time (which we could indicate simply by compiling the embedding part of the test differently in that case). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

I think "C locale + use UTF-8 for stdio + fs" is common setup, especially for servers. It's not mistake or misconfiguration. Perl, Ruby, Rust, Node.JS and Go can use UTF-8 without any pain on C locale. And current Python is painful for such cases. So I strongly +1 for PEP 540 (UTF-8 mode). On the other hand, PEP 538 is for for locale-dependent libraries (like curses) and subprocesses. I agree C locale is misconfiguration if user want to use UTF-8 in locale-dependent libraries. And I agree current PEP 538 seems carrying it a bit too far. But locale coercing works nice on platforms like android. So how about simplified version of PEP 538? Just adding configure option for locale coercing which is disabled by default. No envvar options and no warnings. Regards,

On 13 March 2017 at 18:37, INADA Naoki <songofacandy@gmail.com> wrote:
That doesn't solve my original Linux distro problem, where locale misconfiguration problems show up as "Python 2 works, Python 3 doesn't work" behaviour and bug reports. The problem is that where Python 2 was largely locale-independent by default (just passing raw bytes through) such that you'd only get immediate encoding or decoding errors if you had a Unicode literal or a decode() call somewhere in your code and would otherwise pass data corruption problems further down the chain, Python 3 is locale-*aware* by default, and eagerly decodes: - command line parameters - environment variables - responses from operating system API calls - standard stream input - file contents You *can* still write locale-independent Python 3 applications, but they involve sprinkling liberal doses of "b" prefixes and suffixes and mode settings and "surrogateescape" error handler declarations in various places - you can't just run python-modernize over a pre-existing Python 2 application and expect it to behave the same way in the C locale as it did before. Once implemented, PEP 540 will partially solve the problem by introducing a locale independent UTF-8 mode, but that still leaves the inconsistency with other locale-aware components that are needing to deal with Python 3 API calls that accept or return Unicode objects where Python 2 allowed the use of 8-bit strings. Folks that really want the old behaviour back will be able to set PYTHONCOERCECLOCALE=0 (as that no longer emits any warnings), or else build their own CPython from source using `--without-c-locale-coercion` and ``--without-c-locale-warning`. However, they'll also get the explicit support notification from PEP 11 that any Unicode handling bugs they run into in those configurations are entirely their own problem - we won't fix them, because we consider those configurations unsupportable in the general case. That puts the additional self-support burden on folks doing something unusual (i.e. insisting on running an ASCII-only environment in 2017), rather than on those with a more conventional use case (i.e. running an up to date \*nix OS using UTF-8 or another universal encoding for both local and remote interfaces). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Mar 13, 2017 at 8:01 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sorry, I meant "PEP 540 + Simplified PEP 538 (coercing by configure option)". distros can enable the configure option, off course.
I feel problems PEP 538 solves, but PEP 540 doesn't solve are relatively small compared with complexity introduced PEP 538. As my understanding, PEP 538 solves problems only when: * python executable is used. (GUI applications linking Python for plugin is not affected) * One of C.UTF-8, C.utf8 or UTF8 is accepted for LC_CTYPE. * The "locale aware components" uses something other than ASCII or UTF-8 on C locale, but uses UTF-8 on UTF-8 locale. Can't we reduce options from 3 (2 configure, 1 envvar) when PEP 540 is accepted too?

On Mon, Mar 13, 2017 at 10:31 PM, Random832 <random832@fastmail.com> wrote:
Yes. people who building Python understand about the platform than users in most cases. For android build, they know coercing is works well on android. For Linux distros, they know the system supports locales like C.UTF-8 or not, and there are any python-xxxx packages which may cause the problem and coercing solve it. For people who building Python themselves (in docker, pyenv, etc...) They knows how they use the Python.

On 13 March 2017 at 23:31, Random832 <random832@fastmail.com> wrote:
Distro packagers have narrower user bases and a better known set of compatibility constraints than upstream, so kicking platform integration related config decisions downstream to us(/them) is actually a pretty reasonable thing for upstream to do :) For example, while I've been iterating on the reference implementation for 3.7, Charalampos Stratakis has been iterating on the backport patch for Fedora 26, and he's found that we really need the PEP's "disable the C locale warning" config option to turn off the CLI's coercion warning in addition to the warning in the shared library, as leaving it visible breaks build processes for other packages that check that there aren't any messages being emitted to stderr (or otherwise care about the exact output from build tools that rely on the system Python 3 runtime). However, when it comes to choosing the upstream config defaults, it's important to keep in mind that one of the explicit goals of the PEP is to modify PEP 11 to *formally drop upstream support* for running Python 3 in the legacy C locale without using PEP 538, PEP 540 or a combination of the two to assume UTF-8 instead of ASCII for system interfaces. It's not that you *can't* run Python 3 in that kind of environment, and it's not that there are never any valid reasons to do so. It's that lots of things that you'd typically expect to work are going to misbehave (one I discovered myself yesterday is that the GNU readline problems reported in interactive mode on Android also show up when you do either "LANG=C python2" or "LANG=C python3" on traditional Linux and attempt to *edit* lines containing multi-byte characters), so you really need to know what you're doing in order to operate under those constraints. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Tue, Mar 14, 2017, at 10:17, Nick Coghlan wrote:
It occurs to me that (at least for readline... and maybe also as a general proxy for whether the rest should be done) detecting the IUTF8 terminal flag (which, properly, controls basic non-readline-based line editing such as backspace) may be worthwhile. (And maybe Readline itself should be doing this, more or less independent of Python. But that's a discussion for elsewhere)

On 15 March 2017 at 00:17, Nick Coghlan <ncoghlan@gmail.com> wrote:
The build processes that broke due to the warning were judged to be a bug in autoconf rather than a problem with the warning itself: http://git.savannah.gnu.org/gitweb/?p=autoconf-archive.git;a=commit;h=883a2a... So we're going to leave this as it is in the PEP for now (i.e. the locale coercion warning always happens unless you preconfigure a locale other than C), but keep an eye on it to see if it causes any other problems. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

There was a bunch of discussion about all this a while back, in which I think these points were addressed: However, in some cases the C locale is a normal environment for system
services, cron scripts, distro package builds and whatnot.
Indeed it is. But: if you run a Python (or any) program that is expecting an ASCII-only locale, then it will work jsut fine with any ascii-compatible locale. -- so no problem there. On the other hand, if you run a program that is expectign a unicode-aware locale, then it might barf unexpectently if run on a ASCII-only locale. A lot of people do in fiact have these issues (which are due to mis-configuration of the host system, which is indeed not properly Python's problem). So if we do all this, then: A) mis-configured systems will magically work (sometimes) This is a Good Thing. and B) If someone runs a python program that is expecting Unicode support on an properly configured ASCII-only system, then it will mostly "just work" -- after all a lot of C APIs are simply char*, who cares what the encoding is? It would not, however, fail if when a non-ascii value is used somewhere it shouldn't. So the question nis -- is anyone counting on errors in this case? i.e., is a sysadmin thinking: "I want an ASCII-only system, so I'll set the locale, and now I can expect any program running on this system that is not ascii compatible to fail." I honestly don't know if this is common -- but I would argue that trying to run a unicode-aware program on an ASCII-only system could be considered a mis-configuration as well. Also -- many programs will just be writing bytes to the system without checking encoding anyway. So this would simply let Python3 programs behave like most others... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 15 March 2017 at 06:22, Chris Barker <chris.barker@noaa.gov> wrote:
the assumed default, rather than "C". Even glibc itself would quite like to get to a point where you only get the C locale if you explicitly ask for it: https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 The main practical objection that comes up in relation to "UTF-8 everywhere" isn't to do with UTF-8 per se, but rather with the size of the collation tables needed to do "proper" sorting of Unicode code points. However, there's a neat hack in the design of UTF-8 where sorting the encoded bytes by byte value is equivalent to sorting the decoded text by the Unicode code point values, which means that "LC_COLLATE=C" sorting by byte value, and "LC_COLLATE=C.UTF-8" sorting by "Unicode code point value" give the same results. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mar 15, 2017, at 12:29 PM, Nick Coghlan wrote:
I think it's still the case that some isolation environments (e.g. Debian chroots) default to bare C locales. Often it doesn't matter, but sometimes tests or other applications run inside those environments will fail in ways they don't in a normal execution environment. The answer is almost always to explicitly coerce those environments to C.UTF-8 for Linuxes that support that. -Barry

On 16 March 2017 at 00:30, Barry Warsaw <barry@python.org> wrote:
Yeah, I think mock (the Fedora/RHEL/CentOS build environment for RPMs) still defaults to a bare C locale, and Docker environments usually aren't systemd-managed in the first place (since PID 1 inside a container typically isn't an init system at all). The general trend for all of those seems to be "they don't use C.UTF-8... yet", though (even though some of them may not shift until the default changes at the level of the given distro's libc implementation). The answer is almost always to
explicitly coerce those environments to C.UTF-8 for Linuxes that support that.
I also double checked that "LANG=C ./python -m test" still worked with the reference implementation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick and all core devs who are interested in this PEP. I'm reviewing PEP 538 and I want to accept it in this month. It will reduces much UnicodeError pains which server-side OPs facing. Thank you Nick for working on this PEP. If you have something worrying about this PEP, please post a comment soon. If you don't have enough time to read entire this PEP, feel free to ask a question about you're worrying. Here is my comments:
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html says:
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.
I don't know about .NET runtime on Unix much. (mono and .NET Core). "Go, Node.js and Rust" seems enough examples.
"locale warning" means warning printed when C locale is used, am I right? As my understanding, "locale warning" is shown in these cases (all cases implies under C locale and PYTHONUTF8 is not enabled). a. C locale is used and locale coercion is disabled by ``--without-c-locale-coercion`` configure option. b. locale coercion is failed since there is none of C.UTF-8, C.utf8, nor UTF-8 locale. c. Python is embedded. locale coercion can't be used in this case. In case of (b), while warning about C locale is not shown, warning about coercion is still shown. So when people don't want to see warning under C locale and there is no (C.UTF-8, C.utf8, UTF-8) locales, there are three ways: * Set PYTHONUTF=1 (if PEP 540 is accepted) * Set PYTHONCOERCECLOCALE=0. * Use both of ``--without-c-locale-coercion`` and ``--without-c-locale-warning`` configure options. Is my understanding right? BTW, I prefer PEP 540 provides ``--with-utf8mode`` option which enables UTF-8 mode by default. And if it is added, there are too few use cases for ``--without-c-locale-warning``. There are some use cases people want to use UTF-8 by default in system wide. (e.g. container, webserver in Cent OS, etc...) On the other hand, most of C locale usage are "per application" basis, rather than "system wide." configure option is not suitable for such per application setting, off course. But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined.
Why ``--with[out]-c-locale-coercion`` have no effect on macOS, iOS and Android? On Android, locale coercion fixes readline. Do you mean locale coercion happen always regardless this configuration option? On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to ``ascii:surrogateescape``? Even so, locale coercion may fix libraries like readline, curses. While C locale is less common on macOS, I don't understand any reason to disable it on macOS. I know almost nothing about iOS, but it's similar to Android or macOS in my expectation.
Improving the handling of the C locale --------------------------------------
...
JVM and .NET examples are misleading again. They just use UTF-16-LE for syscall on Windows, like Python. I don't know about them much, but I believe they don't use UTF-16 for system encoding on Linux.
I agree that this PEP shouldn't break byte transparent behavior in C locale by coercing. But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8 locale can be pitfall. I read following part of the section and I agree that there is no way to solve all issue. But how about using surrogateescape handler in C.* locales like C locale? It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least. Anyway, I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python applications which requires byte transparent stdio can use `set_encoding(errors="surrogateescape")` explicitly. Regards,

On 4 May 2017 at 12:24, INADA Naoki <songofacandy@gmail.com> wrote:
I'll push an update to drop the JVM and .NET from the list of examples.
Yes, that sounds right.
Yeah, in addition to Barry requesting such an option in one of the earlier linux-sig reviews, my main rationale for including it is that providing both config options offers a quick compatibility fix for any distro where emitting the coercion and/or C locale warning on stderr causes problems. The only one of those that Fedora encountered in the F26 alpha was deemed a bug in the affected application (something in autotools was checking for "no output on stderr" instead of "subprocess exit code is 0", and the fix was to switch it to check the subprocess exit code), but there are enough Linux distros and BSD variants out there that I'm a lot more comfortable shipping the change with straightforward "off" switches for the stderr output.
But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
+1.
On these three, we know the system encoding is UTF-8, so we never interpreted the C locale as meaning "ascii" in the first place.
Right, the change for Android is that we switch to calling 'setlocale(LC_ALL, "C.UTF-8")' during interpreter startup instead of 'setlocale(LC_ALL, "")'. That change is guarded by "#ifdef __ANDROID__", rather than either of the new conditionals.
On macOS, ``LC_ALL=C python`` doesn't make Python's stdio to ``ascii:surrogateescape``?
Similar to Android, CPython itself is hardcoded to assume UTF-8 on Mac OS X, since that's a platform API guarantee that users can't change.
My understanding is that other libraries and applications also automatically use UTF-8 for system interfaces on Mac OS X and iOS. It could be that that understanding is wrong, and locale coercion would provide a benefit there as well. (Checking the draft implementation, it turns out I haven't actually implemented the configure logic to make those config settings platform dependent yet - they're currently only undefined on Windows by default, since that doesn't use the autotools based build system)
Sorry, this was ambiguous - it's meant to refer to applications calling in to the JVM or CLR app runtime, not to the JVM or CLR calling out to the host operating system. I'll try to make it clearer in the next update.
That would be entirely possible, as the code responsible for that adjustment is the lines: char *loc = setlocale(LC_CTYPE, NULL); if (loc != NULL && strcmp(loc, "C") == 0) errors = "surrogateescape"; Changing that to include "C.UTF-8" as a second locale that also implies the use of `surrogateescape` would be low risk, and means we wouldn't need to call Py_SetStandardStreamEncoding. As a result, non UTF-8 data (such as latin-1 or GB-18030) would automatically round-trip, regardless of whether C.UTF-8 was explicitly set as the locale, or reached as the result of locale coercion.
It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least.
It will also extend host/container encoding mismatch compatibility to containers that explicitly set the C.UTF-8 locale. That makes me more confident in making that change, as it would be rather counterproductive if our changes gave base image developers an incentive *not* to set C.UTF-8 as their default locale.
Agreed. Cheers, Nick. P.S. I've pushed the JVM/CLR related clarifications, but the standard stream changes will require a bit more thought and corresponding updates to the reference implementation - I'll aim to get to that this weekend. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

I tried Python 3.6 on macOS 10.11 El Capitan. $ LANG=C python3 -c 'import locale; print(locale.getpreferredencoding())' US-ASCII And interactive shell (which uses readline by default) doesn't accept non-ASCII input anymore. https://www.dropbox.com/s/otshuzhnw7a71n5/macos-c-locale-readline.gif?dl=0 I think many problems with C locale are same on macOS too. So I don't think no special casing is required on macOS. Regards,

On Thu, 4 May 2017 11:24:27 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
From my POV, it is problematic that the behaviour outlined in PEP 538 (see Abstract section) varies depending on the adoption of another PEP (PEP 540). If we want to adopt PEP 538 before pronouncing on PEP 540, then PEP 538 should remove all points conditional on PEP 540 adoption, and PEP 540 should later be changed to adopt those removed points as PEP 540-specific changes. Regards Antoine.

On 5 May 2017 at 02:25, Antoine Pitrou <solipsis@pitrou.net> wrote:
While I won't be certain until I update the PEP and reference implementation, I'm pretty sure Inada-san's suggestion to replace the call to Py_SetStandardStreamEncoding with defaulting to surrogateescape on the standard streams in the C.UTF-8 locale will remove this current dependency between the PEPs as well as making the "C.UTF-8 locale" and "C locale coerced to C.UTF-8" behaviour indistinguishable at runtime (aside from the stderr warning in the latter case). It will then be up to Victor to state in PEP 540 how locale coercion will interact with Python UTF-8 mode (with my recommendation being the one currently in PEP 538: it should implicitly set the environment variable, so the mode activation is inherited by subprocesses) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, May 4, 2017 at 6:25 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
This is kind of an aside, but regardless of the dependency relationship between PEP 538 and 540, given that they kind of go hand-in-hand would it make sense to rename them--e.g. have PEP 539 and PEP 540 trade places, since PEP 539 has nothing to do with this and is awkwardly nestled between them. Or would that only confuse matters at this point? Thanks, Erik

On 5 May 2017 at 19:45, Erik Bray <erik.m.bray@gmail.com> wrote:
While we have renumbered PEPs in the past, it was only in cases where the PEPs were relatively new, so there weren't many discussions referencing them under their existing numbers. In this case, both PEP 539 and 540 have already been discussed extensively, so renumbering them would cause problems without providing any corresponding benefit (Python's development is sufficiently high volume that it isn't unusual for related PEPs to have non-sequential PEP numbers) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 5 May 2017 at 23:21, INADA Naoki <songofacandy@gmail.com> wrote:
Don't forget that Victor's still working on the design of PEP 540, so it isn't ready for pronouncement yet. Antoine's request was for me to update PEP *538* to eliminate the "this will need to change if PEP 540 is accepted" aspects, and I think your suggestion to make the "C.UTF-8 -> surrogateescape on standard streams by default" behaviour independent of the locale coercion will achieve that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick. After thinking about relationship between PEP 538 and 540 in two days, I came up with idea which removes locale coercion by default from PEP 538, it does just enables UTF-8 mode and show warning about C locale. Of course, this idea is based on PEP 540. There are no "If PEP 540 is rejected". How do you think? If it make sense, I want to postpone PEP 538 until PEP 540 is accepted or rejected, or merge PEP 538 into PEP 540. ## Background Locale coercion in current PEP 538 has some downsides: * If user set `LANG=C LC_DATE=ja_JP.UTF-8`, locale coercion may overrides LC_DATE. * It makes behavior divergence between standalone and embedded Python. * Parent Python process may use utf-8:surrogateescape, but child process Python may use utf-8:strict. (Python 3.6 uses ascii:surrogateescape in both of parent and children). On the other hand, benefits from locale coercion is restricted: * When locale coercion succeeds, warning is always shown. To hide the warning, user must disable coercion in some way. (e.g. use UTF-8 locale explicitly, or set PYTHONCOERCECLOCALE=0). So I feel benefit / complexity ratio of locale coercion is less than UTF-8 mode. But locale coercion works nice on Android. And there are some Android-like Unix systems (container or small device) that C.UTF-8 is always proper locale. ## Rough spec * Make Android-style locale coercion (forced, no warning) is now build option. Some users who build Python for container or small device may like it. * Normal Python build doesn't change locale. When python executable is run in C locale, show locale warning. locale warning can be disabled as current PEP 538. * User can disable automatic UTF-8 mode by setting PYTHONUTF8=0 environment variables. User can hide warning by setting PYTHONUTF8=1 too. On Fri, May 5, 2017 at 10:21 PM, INADA Naoki <songofacandy@gmail.com> wrote:

On 7 May 2017 at 15:22, INADA Naoki <songofacandy@gmail.com> wrote:
The main problems I see with this approach are: 1. There's no way to configure earlier Python versions to emulate PEP 540. It's a completely new mode of operation. 2. PEP 540 isn't actually defined yet (Victor is still working on it) 3. Due to 1&2, PEP 540 isn't something 3.6 redistributors can experiment with backporting to a narrower target audience By contrast, you can emulate PEP 538 all the way back to Python 3.1 by setting the following environment variables: LC_ALL=C.UTF-8 LANG=C.UTF-8 PYTHONIOENCODING=utf-8:surrogateescape (assuming your platform provides a C.UTF-8 locale and you don't need to run any Python 2.x components in that same environment) I think the specific concerns you raise below are valid though, and I'd be happy to amend PEP 538 to address them all.
The fact it sets "LC_ALL" has previously been raised as a concern with PEP 538, so it probably makes sense to drop that aspect and just override "LANG". The scenarios where it makes a difference are incredibly obscure (involving non-default SSH locale forwarding settings for folks using SSH on Mac OS X to connect to remote Linux systems), while just setting "LANG" will be sufficient to address the "LANG=C" case that is the main driver for the PEP. That means in the case above, the specific LC_DATE setting would still take precedence.
* It makes behavior divergence between standalone and embedded Python.
Such divergence already exists, only in the other direction: embedding applications may override the runtime's default settings, either by setting a particular locale, or by using Py_SetStandardStreamEncoding (which was added specifically to make it easy for Blender to force the use of UTF-8 on the embedded Python's standard streams, regardless of the currently locale) That said, this is also the rationale for my suggestion that we expose locale coercion as a public API: if (Py_LegacyLocaleDetected()) { Py_CoerceLegacyLocale(); } That would make it straightforward for any embedding application that wanted to do so to replicate the behaviour of the standard CLI. The level of divergence is also mitigated by the point in the next section.
This discrepancy is gone now thanks to your suggestion of making "surrogateescape" the default standard stream handler when one of the coercion target locales is explicitly configured - both parent processes and child processes end up with "utf-8:surrogateescape" configured on the standard streams.
The current warning is based on what we think is appropriate for Fedora downstream, but that doesn't necessarily mean its the right approach for Python upstream, especially if the LC_ALL override is dropped. We could also opt for a model where Python 3.7 emits the coercion warning, but Python 3.8 just does the coercion silently (that rationale would then also apply to PEP 540 - we'd warn on stderr about the change in default behaviour in 3.7, but take the new behaviour for granted in 3.8). The change to make the standard stream error handler setting depend solely on the currently configured locale also helps here, since it means it doesn't matter how a process reached the state of having the locale set to "C.UTF-8". CPython will behave the same way regardless, so it makes it less import to provide an explicit notice that coercion took place.
So I feel benefit / complexity ratio of locale coercion is less than UTF-8 mode.
It isn't an either/or though - we're entirely free to do both, one based solely on the existing configuration options that have been around since 3.1, and the other going beyond those to also adjust the default behaviour of other interfaces (like "open()").
But do we *want* to support the legacy C locale in 3.7+? I don't think we do, because it will never work properly for our purposes as long as it assumes ASCII as the default text encoding. Part of the motivation for making locale coercion the default is so we can update PEP 11 to make it clear that running in the legacy C locale is no longer an officially supported configuration.
That still pushes the problem back on end users to fix, though, rather than just automatically making things like GNU readline integration work.
I think I need to better explain in the PEP why PEP 540's UTF-8 mode on its own won't be enough, as it doesn't necessarily handle locale-aware extension modules like GNU readline (this came up in the draft PR review, but I never added anything specifically to the PEP about it), and also doesn't help at all with invocation of older 3.x releases in a subprocess. Here's an interactive session from a PEP 538 enabled CPython, where each line after the first is executed by doing "up-arrow, 4xleft-arrow, delete, enter" $ LANG=C ./python Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behavior). Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Not exactly exciting, but this is what currently happens on an older release if you only change the Python level stream encoding settings without updating the locale settings: $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌ�") File "<stdin>", line 0 ^ SyntaxError: 'utf-8' codec can't decode bytes in position 20-21: invalid continuation byte That particular misbehaviour is coming from GNU readline, *not* CPython - because the editing wasn't UTF-8 aware, it corrupted the history buffer and fed such nonsense to stdin that even the surrogateescape error handler was bypassed. While PEP 540's UTF-8 mode could technically be updated to also reconfigure readline, that's *one* extension module, and only when it's running directly as part of Python 3.7. By contrast, using a more appropriate locale setting already gets readline to play nice, even when its running inside Python 3.5: $ LANG=C.UTF-8 python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Don't get me wrong, I'm definitely a fan of PEP 540, as it extends much of what PEP 538 covers beyond the standard streams and also applies it to other operating system interfaces without relying on the underlying operating system to provide a UTF-8 based locale. However, I also expect it to be plagued by extension module compatibility issues if folks attempt to use it standalone, without locale coercion to reconfigure the behaviour of extension modules appropriately. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 8 May 2017 at 15:34, Nick Coghlan <ncoghlan@gmail.com> wrote:
It occurs to me we can even still handle the forwarded "LC_CTYPE=UTF-8" case by changing the locale coercion to set LC_CTYPE & LANG, rather than just setting LANG as I suggested above. That way `LANG=C LC_DATE=ja_JP.UTF-8` would still respect the explicit LC_DATE setting, `LC_CTYPE=C` would be handled the same way as `LANG=C`, and LC_ALL=C would continue to provide a way to force the C locale even for LC_CTYPE without needing to be aware of the Python specific PYTHONCOERCECLOCALE setting. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 May 2017 at 13:44, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've posted an updated reference implementation that works this way, and it turned out to have some rather nice benefits: not only did it make the handling of full locales (C.UTF-8, C.utf8) and partial locales (UTF-8) more consistent (allowing for a net deletion of code), it also meant I no longer needed a custom test case in _testembed to check the locale warning. Instead, the affected test cases now just set "LC_ALL" as a locale override that switches off CPython's locale coercion without also switching off the locale warning. Code changes: https://github.com/ncoghlan/cpython/commit/476a78133c94d82e19b89f50036cecd9b... Rather than posting the PEP updates here though, I'll start a new thread that explains what has changed since my initial posting to python-dev back in March. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Mar 4, 2017 at 11:50 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
It feels like having a short section on the caveats of this approach would help to introduce this section. Something that says that this PEP can cause a split in how Python behaves in non-sandalone applications (mod_wsgi, IDEs where libpython is compiled in, etc) vs standalone (unless the embedders take similar steps as standalone python is doing). Then go on to state that this approach was still chosen as coercing in Py_Initialize is too late, causing the inconsistencies and problems listed here. -Toshio

On 5 March 2017 at 17:50, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've just pushed a significant update to the PEP based on the discussions in this thread: https://github.com/python/peps/commit/2fb53e7c1bbb04e1321bca11cc0112aec69f63... The main change at the technical level is to modify the handling of the coercion target locales such that they *always* lead to "surrogateescape" being used by default on the standard streams. That means we don't need to call "Py_SetStandardStreamEncoding" during startup, that subprocesses will behave the same way as their parent processes, and that Python in Linux containers will behave consistently regardless of whether the container locale is set to "C.UTF-8" explicitly, or is set to "C" and then coerced to "C.UTF-8" by CPython. That change also eliminated the behaviour that was contingent on whether or not PEP 540 was accepted - PEP 540 may still want to have the coercion target locales imply full UTF-8 mode rather than just setting the stream error handler differently, but that will be a question to be considered when reviewing PEP 540 rather than needing to worry about it now. The second technical change is that the locale coercion and warning are now enabled on Android and Mac OS X. For Android, that's a matter of getting GNU readline to behave sensibly, while for Mac OS X, it's a matter of simplifying the implementation and improving cross-platform behavioural consistency (even though we don't expect the coercion to actually have much impact there). Beyond that, the PEP update focuses on clarifying a few other points without actually changing the proposal. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 May 2017 at 18:00, Nick Coghlan <ncoghlan@gmail.com> wrote:
Working on the revised implementation for this, I've ended up refactoring it so that all the heavy lifting is done by a single function exported from the shared library: "_Py_CoerceLegacyLocale()". The CLI code then just contains the check that says "Are we running in the legacy C locale? If so, call _Py_CoerceLegacyLocale()", with all the details of how the coercion actually works being hidden away inside pylifecycle.c. That seems like a potential opportunity to make the 3.7 version of this a public API, using the following pattern: if (Py_LegacyLocaleDetected()) { Py_CoerceLegacyLocale(); } That way applications embedding CPython that wanted to implement the same locale coercion logic would have an easy way to do so. Thoughts? Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 6 May 2017 at 18:33, Nick Coghlan <ncoghlan@gmail.com> wrote:
OK, the reference implementation has been updated to match the latest version of the PEP: https://github.com/ncoghlan/cpython/commit/188e7807b6d9e49377aacbb287c074e5c... For now, the implementation in the standalone CLI looks like this: /* [snip] */ extern int _Py_LegacyLocaleDetected(void); extern void _Py_CoerceLegacyLocale(void); /* [snip] */ if (_Py_LegacyLocaleDetected()) { _Py_CoerceLegacyLocale(); } If we decide to make this a public API for 3.7, the necessary changes would be: - remove the leading underscore from the function names - add the function prototypes to the pylifecycle.h header - add the APIs to the C API documentation in the configuration & initialization section - define the APIs in the PEP - adjust the backport note in the PEP to say that backports should NOT expose the public C API, but keep it private Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (10)
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Barker
-
Erik Bray
-
Guido van Rossum
-
INADA Naoki
-
Jakub Wilk
-
Nick Coghlan
-
Random832
-
Toshio Kuratomi