PEP 538 (review round 2): Coercing the legacy C locale to a UTF-8 based locale

Hi folks, Enough changes have accumulated in PEP 538 since the start of the previous thread that it seems sensible to me to start a new thread specifically covering the current design (which aims to address all the concerns raised in the previous thread). I haven't requoted the PEP in full since it's so long, but will instead refer readers to the web version: https://www.python.org/dev/peps/pep-0538/ I also generated a diff covered the full changes to the PEP text: * https://gist.github.com/ncoghlan/1067805fe673b3735ac854e195747493/revisions (this is the diff covering the last few days of changes Summarising the key technical changes: * to make the runtime behaviour independent of whether or not locale coercion took place, stdin and stderr now always have "surrogateescape" as their error handler in the potential coercion target locales. This means Python will behave the same way regardless of whether the locale gets set externally (e.g. by a parent Python process or a container image definition) or implicitly during CLI startup * for the full locales, the interpreter now sets LC_CTYPE and LANG, *not* LC_ALL. This means LC_ALL is once again a full locale override, and also means that CPython won't inadvertently interfere with other locale categories like LC_MONETARY, LC_NUMERIC, etc * the reference implementation has been refactored so the bulk of the new code lives in the shared library and is exposed to the linker via a couple of underscore prefixed API symbols (_Py_LegacyLocaleDetected() and _Py_CoerceLegacyLocale()). While the current PEP still keeps them private, it would be straightforward to make them public for use in embedding applications if we decided we wanted to do so. * locale coercion and warnings are now enabled by default on all platforms that use the autotools-based build chain - the assumption that some platforms didn't need them turned out to be incorrect In addition to being updated to cover the above changes, the Rationale section of the PEP has also been updated to explain why it doesn't propose setting PYTHONIOENCODING, and to walk through some examples of the problems with GNU readlines compatibility when the current locale isn't set correctly. The essential related changes to the reference implementation can be seen here: * Always set "surrogateescape" for coercion target locales, independently of whether or not coercion occurred: https://github.com/ncoghlan/cpython/commit/188e7807b6d9e49377aacbb287c074e5c... * Stop setting LC_ALL: https://github.com/python/peps/commit/2f530ce0d1fd24835ac0c6f984f40db70482a1... (There are also some smaller cleanup commits that can be seen by browsing that branch on GitHub) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 May 2017 at 21:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sorry, I just noticed the copy & paste error when posting that second link. The correct link is: https://github.com/ncoghlan/cpython/commit/476a78133c94d82e19b89f50036cecd9b... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick. I read again and I think PEP 538 is mostly ready for accepted, without waiting PEP 540. One remaining my concern is setting LANG.
Setting LANG to C.UTF-8 ensures that even components that only check the LANG fallback for their locale settings will still use C.UTF-8 . https://www.python.org/dev/peps/pep-0538/#setting-both-lc-ctype-lang-for-utf...
I feel setting only LC_CTYPE making PEP 538 simpler. Is there any real component using LANG for deciding encoding? For example, date command refers LC_TIME. $ LANG=ja_JP.UTF-8 LC_CTYPE=C date 2017年 5月 23日 火曜日 17:31:03 JST $ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE 2017年 5月 23日 火曜日 17:32:58 JST $ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG Tue May 23 17:31:10 JST 2017 In this case, coercing only LC_CTYPE has less side-effect. Would you add example demonstrates how coercing LANG helps people? On Tue, May 9, 2017 at 8:57 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On 23 May 2017 at 18:38, INADA Naoki <songofacandy@gmail.com> wrote:
Great to hear!
I'm honestly not sure it does - I think it's an assumption I added to the PEP early on, and never actually tested. Looking at it more closely now, all of the interpreter level checks are specifically for LC_CTYPE, and experimenting with "LANG=C LC_CTYPE=C.UTF-8" indicates that coercing only LC_CTYPE is still enough to fix the GNU readline encoding compatibility problem covered in https://www.python.org/dev/peps/pep-0538/#considering-locale-coercion-indepe... So I'll take another pass through the implementation this weekend, and simplify it to only set LC_CTYPE regardless of whether it's using C.UTF-8, C.utf8, or UTF-8 as the target locale. Assuming that doesn't uncover any hidden problems with the idea, I'll then update the PEP to match. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 24 May 2017 at 02:34, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've now gone through this, and as far as I can tell, setting only LC_CTYPE is sufficient to handle all the scenarios that the PEP aims to address, and has fewer potential side effects than setting both LC_CTYPE and LANG. Accordingly, I've updated both the PEP and the implementation to only set LC_CTYPE and leave LANG alone: * PEP: https://github.com/python/peps/commit/12cecb05489e74a36a11c17e8d0b1e36e3768b... * Implementation: https://github.com/python/cpython/pull/659/commits/939ba0a77d4b52a04315c129f... Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Now I approve the PEP 538. It's side-effect (just set LC_CTYPE envvar) seems simple enough and moderate enough. Locale coercion will save people from unwanted mojibake (escaped string) and locale warning will navigate people to configure locale properly. And there are configure options and envvar option to disable it for people who want to continue to use C locale explicitly. Congrats, Nick! On Sat, May 27, 2017 at 4:19 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On 28 May 2017 at 16:46, INADA Naoki <songofacandy@gmail.com> wrote:
Thank you! And thank you for your work in reviewing the PEP - I think the accepted version is a significant improvement over the more intrusive design I originally proposed downstream in Fedora :) Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 09/05/2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
I did try to follow along via the mailing list threads, and have now read over the PEP again. Responding now as I'm actually touching code relevent to this again. Broadly the proposal looks good to me. It does help one of the two cases I care about, and does no serious harm. For a command line Python script, making sure Python itself uses UTF-8 for the C locale is sufficient, and setting LC_CTYPE so spawned processes that aren't Python have a chance at doing the right thing too is a reasonable upgrade. This is probably good enough to drop one hack[1] rather than porting it to Python 3. For hosted Python code this does nothing (apart from print to stderr), so mod_wsgi for instance is still going to need the same kind dance to get users to set LANG as configuration themselves. Ideally this PEP would have a C api or something so I could file bugs to make it just do the right thing. A few notes on specifics, I'll try not to stray too much into choices already made. The PEP doesn't persuade me that Py_Initialize actually is too late to switch *specifically* from ascii to utf-8. Any preceeding operations that operate on unicode would have been a safe subset. There might be issues with other internals, or surrogateescape, or it's just a pain? I don't like the side effect of changing the standard stream error handler to surrogateescape if LANG=C.UTF-8 is actually set. Obviously bad data vs exception is a trade off anyway, but means to get a Python script that will always output valid data or exit, you have to set an arbitrary language like en_US. Yes, that's true of the change as implemented in 3.5 anyway. Not setting LANG and only setting LC_CTYPE seems fine. Obviously, things can go wrong based on odd behaviours of spawned processes, but it works for the normal idioms. I'm not sold on adding the PYTHONCOERCECLOCALE runtime configuration. All it really does is turn off stderr kipple if you must use the C locale for other reasons? Anyone with the ability to set that variable could just set LANG instead. I was going to suggest just documenting LC_ALL=C as the override instead of adding a python specific variable, but note looking around that Debian discourage that[3]. That's all, though I will also grumble a bit about how long the PEP is. Martin [1] Override Py_FileSystemDefaultEncoding to utf-8 from ascii for the bzr script <https://code.launchpad.net/~gz/bzr/filesystem_default_encoding_794353/+merge...> [2] WSGIDaemonProcess lang and locale options <https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemo...> [3] "Using LC_ALL is strongly discouraged as it overrides everything" <https://wiki.debian.org/Locale#Configuration>

On 12 June 2017 at 10:05, Martin (gzlist) via Python-Dev <python-dev@python.org> wrote:
`PYTHONIOENCODING=:strict` remains the preferred way of forcing strict encoding checks on the standard streams, regardless of locale.
In addition to providing a reliable escape hatch with no other potentially unwanted side effects (for when folks actually want the current behaviour), the entry for the off switch in the CLI usage docs also provides us with a convenient place to document the *default* behaviour.
That's all, though I will also grumble a bit about how long the PEP is.
The ASCII-to-Unicode migration has been in progress for almost as long as Python has been around, and ASCII has been the default encoding in C for almost twice as long as that, so it takes a bit of text to explain why *now* is a good time to break with 50+ years of precedent :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for replying to my points! On 12/06/2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
`PYTHONIOENCODING=:strict` remains the preferred way of forcing strict encoding checks on the standard streams, regardless of locale.
Then the user of my script has to care that it's written in Python and set that specifically in their crontab or so on...
The documentation aspect is an interesting consideration. Having thought about it a bit more, my preferred option is having the disable be if either LC_ALL or LC_CTYPE vars are exactly 'C', then don't override. Otherwise (including for LANG=C), force C.UTF-8. The CLI usage docs could have a LC_CTYPE entry that goes into details of why. Martin

That's why I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python should have one preferable way to specify encoding and error handler from inside of the program, not from envvar or command line argument. Regards,

On 12 June 2017 at 17:47, Martin (gzlist) <gzlist@googlemail.com> wrote:
As Inada-san wrote, we think the right way to fix that is to make it easier and safer for application developers to override the default settings on the standard streams. At the moment, doing so requires rebinding sys.stdin/out/err, which means you end up with multiple Python level streams sharing the one underlying C stream, which can cause problems. The basic API for that was recently merged (`TextIOWrapper.reconfigure()`), so it's now a matter of extending it to also allow updating `encoding` and `errors`.
LC_ALL=C doesn't actually disable the locale coercion (i.e. we still set LC_CTYPE). The coercion just doesn't have any effect, since LC_ALL takes precedence. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 June 2017 at 22:05, Nick Coghlan <ncoghlan@gmail.com> wrote:
After improving the test suite to better cover this case, it seems my assumptions regarding the behaviour of setlocale() when LC_ALL is set may have been incorrect - when LC_ALL=C is set, we *only* get the legacy locale warning, *not* the locale coercion warning (at least on Fedora - we'll know more about the behaviour on other platforms once I test my proposed resolution for https://bugs.python.org/issue30565 across the buildbot fleet). So if we chose to, we could treat an explicit "LC_CTYPE=C" the same way we treat "PYTHONCOERCECLOCALE=0" - it's definitely worth considering, so please file an RFE for that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 9 May 2017 at 21:57, Nick Coghlan <ncoghlan@gmail.com> wrote:
Sorry, I just noticed the copy & paste error when posting that second link. The correct link is: https://github.com/ncoghlan/cpython/commit/476a78133c94d82e19b89f50036cecd9b... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Hi, Nick. I read again and I think PEP 538 is mostly ready for accepted, without waiting PEP 540. One remaining my concern is setting LANG.
Setting LANG to C.UTF-8 ensures that even components that only check the LANG fallback for their locale settings will still use C.UTF-8 . https://www.python.org/dev/peps/pep-0538/#setting-both-lc-ctype-lang-for-utf...
I feel setting only LC_CTYPE making PEP 538 simpler. Is there any real component using LANG for deciding encoding? For example, date command refers LC_TIME. $ LANG=ja_JP.UTF-8 LC_CTYPE=C date 2017年 5月 23日 火曜日 17:31:03 JST $ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE 2017年 5月 23日 火曜日 17:32:58 JST $ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG Tue May 23 17:31:10 JST 2017 In this case, coercing only LC_CTYPE has less side-effect. Would you add example demonstrates how coercing LANG helps people? On Tue, May 9, 2017 at 8:57 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On 23 May 2017 at 18:38, INADA Naoki <songofacandy@gmail.com> wrote:
Great to hear!
I'm honestly not sure it does - I think it's an assumption I added to the PEP early on, and never actually tested. Looking at it more closely now, all of the interpreter level checks are specifically for LC_CTYPE, and experimenting with "LANG=C LC_CTYPE=C.UTF-8" indicates that coercing only LC_CTYPE is still enough to fix the GNU readline encoding compatibility problem covered in https://www.python.org/dev/peps/pep-0538/#considering-locale-coercion-indepe... So I'll take another pass through the implementation this weekend, and simplify it to only set LC_CTYPE regardless of whether it's using C.UTF-8, C.utf8, or UTF-8 as the target locale. Assuming that doesn't uncover any hidden problems with the idea, I'll then update the PEP to match. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 24 May 2017 at 02:34, Nick Coghlan <ncoghlan@gmail.com> wrote:
I've now gone through this, and as far as I can tell, setting only LC_CTYPE is sufficient to handle all the scenarios that the PEP aims to address, and has fewer potential side effects than setting both LC_CTYPE and LANG. Accordingly, I've updated both the PEP and the implementation to only set LC_CTYPE and leave LANG alone: * PEP: https://github.com/python/peps/commit/12cecb05489e74a36a11c17e8d0b1e36e3768b... * Implementation: https://github.com/python/cpython/pull/659/commits/939ba0a77d4b52a04315c129f... Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Now I approve the PEP 538. It's side-effect (just set LC_CTYPE envvar) seems simple enough and moderate enough. Locale coercion will save people from unwanted mojibake (escaped string) and locale warning will navigate people to configure locale properly. And there are configure options and envvar option to disable it for people who want to continue to use C locale explicitly. Congrats, Nick! On Sat, May 27, 2017 at 4:19 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

On 28 May 2017 at 16:46, INADA Naoki <songofacandy@gmail.com> wrote:
Thank you! And thank you for your work in reviewing the PEP - I think the accepted version is a significant improvement over the more intrusive design I originally proposed downstream in Fedora :) Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 09/05/2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
I did try to follow along via the mailing list threads, and have now read over the PEP again. Responding now as I'm actually touching code relevent to this again. Broadly the proposal looks good to me. It does help one of the two cases I care about, and does no serious harm. For a command line Python script, making sure Python itself uses UTF-8 for the C locale is sufficient, and setting LC_CTYPE so spawned processes that aren't Python have a chance at doing the right thing too is a reasonable upgrade. This is probably good enough to drop one hack[1] rather than porting it to Python 3. For hosted Python code this does nothing (apart from print to stderr), so mod_wsgi for instance is still going to need the same kind dance to get users to set LANG as configuration themselves. Ideally this PEP would have a C api or something so I could file bugs to make it just do the right thing. A few notes on specifics, I'll try not to stray too much into choices already made. The PEP doesn't persuade me that Py_Initialize actually is too late to switch *specifically* from ascii to utf-8. Any preceeding operations that operate on unicode would have been a safe subset. There might be issues with other internals, or surrogateescape, or it's just a pain? I don't like the side effect of changing the standard stream error handler to surrogateescape if LANG=C.UTF-8 is actually set. Obviously bad data vs exception is a trade off anyway, but means to get a Python script that will always output valid data or exit, you have to set an arbitrary language like en_US. Yes, that's true of the change as implemented in 3.5 anyway. Not setting LANG and only setting LC_CTYPE seems fine. Obviously, things can go wrong based on odd behaviours of spawned processes, but it works for the normal idioms. I'm not sold on adding the PYTHONCOERCECLOCALE runtime configuration. All it really does is turn off stderr kipple if you must use the C locale for other reasons? Anyone with the ability to set that variable could just set LANG instead. I was going to suggest just documenting LC_ALL=C as the override instead of adding a python specific variable, but note looking around that Debian discourage that[3]. That's all, though I will also grumble a bit about how long the PEP is. Martin [1] Override Py_FileSystemDefaultEncoding to utf-8 from ascii for the bzr script <https://code.launchpad.net/~gz/bzr/filesystem_default_encoding_794353/+merge...> [2] WSGIDaemonProcess lang and locale options <https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemo...> [3] "Using LC_ALL is strongly discouraged as it overrides everything" <https://wiki.debian.org/Locale#Configuration>

On 12 June 2017 at 10:05, Martin (gzlist) via Python-Dev <python-dev@python.org> wrote:
`PYTHONIOENCODING=:strict` remains the preferred way of forcing strict encoding checks on the standard streams, regardless of locale.
In addition to providing a reliable escape hatch with no other potentially unwanted side effects (for when folks actually want the current behaviour), the entry for the off switch in the CLI usage docs also provides us with a convenient place to document the *default* behaviour.
That's all, though I will also grumble a bit about how long the PEP is.
The ASCII-to-Unicode migration has been in progress for almost as long as Python has been around, and ASCII has been the default encoding in C for almost twice as long as that, so it takes a bit of text to explain why *now* is a good time to break with 50+ years of precedent :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks for replying to my points! On 12/06/2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
`PYTHONIOENCODING=:strict` remains the preferred way of forcing strict encoding checks on the standard streams, regardless of locale.
Then the user of my script has to care that it's written in Python and set that specifically in their crontab or so on...
The documentation aspect is an interesting consideration. Having thought about it a bit more, my preferred option is having the disable be if either LC_ALL or LC_CTYPE vars are exactly 'C', then don't override. Otherwise (including for LANG=C), force C.UTF-8. The CLI usage docs could have a LC_CTYPE entry that goes into details of why. Martin

That's why I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python should have one preferable way to specify encoding and error handler from inside of the program, not from envvar or command line argument. Regards,

On 12 June 2017 at 17:47, Martin (gzlist) <gzlist@googlemail.com> wrote:
As Inada-san wrote, we think the right way to fix that is to make it easier and safer for application developers to override the default settings on the standard streams. At the moment, doing so requires rebinding sys.stdin/out/err, which means you end up with multiple Python level streams sharing the one underlying C stream, which can cause problems. The basic API for that was recently merged (`TextIOWrapper.reconfigure()`), so it's now a matter of extending it to also allow updating `encoding` and `errors`.
LC_ALL=C doesn't actually disable the locale coercion (i.e. we still set LC_CTYPE). The coercion just doesn't have any effect, since LC_ALL takes precedence. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 June 2017 at 22:05, Nick Coghlan <ncoghlan@gmail.com> wrote:
After improving the test suite to better cover this case, it seems my assumptions regarding the behaviour of setlocale() when LC_ALL is set may have been incorrect - when LC_ALL=C is set, we *only* get the legacy locale warning, *not* the locale coercion warning (at least on Fedora - we'll know more about the behaviour on other platforms once I test my proposed resolution for https://bugs.python.org/issue30565 across the buildbot fleet). So if we chose to, we could treat an explicit "LC_CTYPE=C" the same way we treat "PYTHONCOERCECLOCALE=0" - it's definitely worth considering, so please file an RFE for that. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Ethan Furman
-
INADA Naoki
-
Martin (gzlist)
-
Nick Coghlan