[Python-Dev] PEP 538 (review round 2): Coercing the legacy C locale to a UTF-8 based locale

Nick Coghlan ncoghlan at gmail.com
Tue May 9 07:57:04 EDT 2017


Hi folks,

Enough changes have accumulated in PEP 538 since the start of the
previous thread that it seems sensible to me to start a new thread
specifically covering the current design (which aims to address all
the concerns raised in the previous thread).

I haven't requoted the PEP in full since it's so long, but will
instead refer readers to the web version:
https://www.python.org/dev/peps/pep-0538/

I also generated a diff covered the full changes to the PEP text:

* https://gist.github.com/ncoghlan/1067805fe673b3735ac854e195747493/revisions
(this is the diff covering the last few days of changes

Summarising the key technical changes:

* to make the runtime behaviour independent of whether or not locale
coercion took place, stdin and stderr now always have
"surrogateescape" as their error handler in the potential coercion
target locales. This means Python will behave the same way regardless
of whether the locale gets set externally (e.g. by a parent Python
process or a container image definition) or implicitly during CLI
startup
* for the full locales, the interpreter now sets LC_CTYPE and LANG,
*not* LC_ALL. This means LC_ALL is once again a full locale override,
and also means that CPython won't inadvertently interfere with other
locale categories like LC_MONETARY, LC_NUMERIC, etc
* the reference implementation has been refactored so the bulk of the
new code lives in the shared library and is exposed to the linker via
a couple of underscore prefixed API symbols
(_Py_LegacyLocaleDetected() and _Py_CoerceLegacyLocale()). While the
current PEP still keeps them private, it would be straightforward to
make them public for use in embedding applications if we decided we
wanted to do so.
* locale coercion and warnings are now enabled by default on all
platforms that use the autotools-based build chain - the assumption
that some platforms didn't need them turned out to be incorrect

In addition to being updated to cover the above changes, the Rationale
section of the PEP has also been updated to explain why it doesn't
propose setting PYTHONIOENCODING, and to walk through some examples of
the problems with GNU readlines compatibility when the current locale
isn't set correctly.

The essential related changes to the reference implementation can be seen here:

* Always set "surrogateescape" for coercion target locales,
independently of whether or not coercion occurred:
https://github.com/ncoghlan/cpython/commit/188e7807b6d9e49377aacbb287c074e5cabf70c5
* Stop setting LC_ALL:
https://github.com/python/peps/commit/2f530ce0d1fd24835ac0c6f984f40db70482a18f

(There are also some smaller cleanup commits that can be seen by
browsing that branch on GitHub)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list