Late Python 3.7.1 changes to fix the C locale coercion (PEP 538) implementation

Hi Unicode and locales lovers, tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes? Before 3.7.0 release, during the implementation of the UTF-8 Mode (PEP 540), I changed two things in Nick Coghlan's implementation of the C locale coercion (PEP 538): (1) PYTHONCOERCECLOCALE environment variable is now ignored when -E or -I command line option is used. (2) When Python is embeded, the C locale coercion is now enabled if the LC_CTYPE locale is "C". Nick asked me to change the behavior: https://bugs.python.org/issue34589 I just pushed this change in the 3.7 branch which adds a new "-X coerce_c_locale=value" option: https://github.com/python/cpython/commit/144f1e2c6f4a24bd288c045986842c65cc2... Examples using Pyhon 3.7 (future 3.7.1) with UTF-8 Mode disabled, to only test the C locale coercion: --- $ cat test.py import codecs, locale enc = locale.getpreferredencoding() enc = codecs.lookup(enc).name print(enc) $ export LC_ALL= LC_CTYPE=C LANG= # Disable C locale coercion: get ASCII as expected $ PYTHONCOERCECLOCALE=0 ./python -X utf8=0 test.py ascii # -E ignores PYTHONCOERCECLOCALE=0: # C locale is coerced, we get UTF-8 $ PYTHONCOERCECLOCALE=0 ./python -E -X utf8=0 test.py utf-8 # -X coerce_c_locale=0 is not affected by -E: # C locale coercion disabled as expected, get ASCII as expected $ ./python -E -X utf8=0 -X coerce_c_locale=0 test.py ascii --- For (1), Nick's use case is to get Python 3.6 behavior (C locale not coerced) on Python 3.7 using PYTHONCOERCECLOCALE. Nick proposed to use PYTHONCOERCECLOCALE even with -E or -I, but I dislike introducing a special case for -E option. I chose to add a new "-X coerce_c_locale=0" to Python 3.7.1 to provide a solution for this use case. (Python 3.7.0 and older ignore this option.) Note: Python 3.7.0 is fine with PYTHONCOERCECLOCALE=0, we are only talking about the special case of -E and -I options. For (2), I modified Python 3.7.1 to make sure the C locale is never coerced when the C API is used to embed Python inside an application: Py_Initialize() and Py_Main(). The C locale can only be coerced by the official Python program ("python3.7"). I don't know if it should be possible to enable C locale coercion when Python is embedded. So I just made the change requested by Nick :-) I dislike doing such late changes in 3.7.1, especially since PEP 538 has been designed by Nick Coghlan, and we disagree on the fix. But Ned Deily, our Python 3.7 release manager, wants to see last 3.7 fixes merged before Tuesday, so here we are. Nick, Ned, INADA-san: are you ok with these changes? The other choices for 3.7.1 are: * Revert my change: C locale coercion can still be enabled when Python is embedded, -E option ignores PYTHONCOERCECLOCALE env var. * Revert my change and apply Nick's PR 9257: C locale coercion cannot be enabled when Python is embedded and -E option doesn't ignore PYTHONCOERCECLOCALE env var. I spent months to fix the master branch to support all possible locales and encodings, and get a consistent CLI: https://vstinner.github.io/python3-locales-encodings.html So I'm not excited by Nick's PR which IMHO moves Python backward, especially it breaks the -E option contract: it doesn't ignore PYTHONCOERCECLOCALE env var. Victor

On Sep 17, 2018, at 21:20, Victor Stinner <vstinner@redhat.com> wrote:
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Before 3.7.0 release, during the implementation of the UTF-8 Mode (PEP 540), I changed two things in Nick Coghlan's implementation of the C locale coercion (PEP 538):
(1) PYTHONCOERCECLOCALE environment variable is now ignored when -E or -I command line option is used.
(2) When Python is embeded, the C locale coercion is now enabled if the LC_CTYPE locale is "C".
Nick asked me to change the behavior: https://bugs.python.org/issue34589
I just pushed this change in the 3.7 branch which adds a new "-X coerce_c_locale=value" option: https://github.com/python/cpython/commit/144f1e2c6f4a24bd288c045986842c65cc2...
Examples using Pyhon 3.7 (future 3.7.1) with UTF-8 Mode disabled, to only test the C locale coercion: --- $ cat test.py import codecs, locale enc = locale.getpreferredencoding() enc = codecs.lookup(enc).name print(enc)
$ export LC_ALL= LC_CTYPE=C LANG=
# Disable C locale coercion: get ASCII as expected $ PYTHONCOERCECLOCALE=0 ./python -X utf8=0 test.py ascii
# -E ignores PYTHONCOERCECLOCALE=0: # C locale is coerced, we get UTF-8 $ PYTHONCOERCECLOCALE=0 ./python -E -X utf8=0 test.py utf-8
# -X coerce_c_locale=0 is not affected by -E: # C locale coercion disabled as expected, get ASCII as expected $ ./python -E -X utf8=0 -X coerce_c_locale=0 test.py ascii ---
For (1), Nick's use case is to get Python 3.6 behavior (C locale not coerced) on Python 3.7 using PYTHONCOERCECLOCALE. Nick proposed to use PYTHONCOERCECLOCALE even with -E or -I, but I dislike introducing a special case for -E option.
I chose to add a new "-X coerce_c_locale=0" to Python 3.7.1 to provide a solution for this use case. (Python 3.7.0 and older ignore this option.)
Note: Python 3.7.0 is fine with PYTHONCOERCECLOCALE=0, we are only talking about the special case of -E and -I options.
For (2), I modified Python 3.7.1 to make sure the C locale is never coerced when the C API is used to embed Python inside an application: Py_Initialize() and Py_Main(). The C locale can only be coerced by the official Python program ("python3.7").
I don't know if it should be possible to enable C locale coercion when Python is embedded. So I just made the change requested by Nick :-)
I dislike doing such late changes in 3.7.1, especially since PEP 538 has been designed by Nick Coghlan, and we disagree on the fix. But Ned Deily, our Python 3.7 release manager, wants to see last 3.7 fixes merged before Tuesday, so here we are.
Just because the 3.7.1rc is scheduled doesn't mean we should throw something in, particularly if it's not fully reviewed and fully agreed upon. If it's important enough, we could delay the rc a few days ... or decide to wait for 3.7.2.
Nick, Ned, INADA-san: are you ok with these changes? The other choices for 3.7.1 are:
* Revert my change: C locale coercion can still be enabled when Python is embedded, -E option ignores PYTHONCOERCECLOCALE env var.
* Revert my change and apply Nick's PR 9257: C locale coercion cannot be enabled when Python is embedded and -E option doesn't ignore PYTHONCOERCECLOCALE env var.
I spent months to fix the master branch to support all possible locales and encodings, and get a consistent CLI: https://vstinner.github.io/python3-locales-encodings.html
So I'm not excited by Nick's PR which IMHO moves Python backward, especially it breaks the -E option contract: it doesn't ignore PYTHONCOERCECLOCALE env var.
I would like to see Nick review the merged 3.7 PR and have both him and you agree that this is the thing to do for 3.7.1. I also want to make sure we understand what affect this will have on 3.7.0 users. Let's not potentially make things worse. I'm not planning to tag 3.7.1rc for at least another 18 hours. I'm marking bpo-34589 as "release blocker" and I will not proceed until this is resolved. Thanks! --Ned -- Ned Deily nad@python.org -- []

I think the changes to both master and the 3.7 branch should be reverted. For 3.7, I already said that I think we should just accept that that ship has sailed with 3.7.0 and leave the as-shipped implementation alone for the rest of the 3.7 series: https://bugs.python.org/issue34589#msg325242 It isn't the way I intended it to work, but the kinds of large scale architectural changes the intended implementation is designed to cope with aren't going to happen on a maintenance branch anyway. For 3.8, after Victor's rushed changes have been reverted, my PR should be conflict free again, and we'll be able to get PEP 538 back to working the way it was always supposed to work (while keeping the genuine stdio handling fixes that Victor's refactoring provided): https://github.com/python/cpython/pull/9257 Regards, Nick. On Tue, 18 Sep 2018 at 11:42, Ned Deily <nad@python.org> wrote:
On Sep 17, 2018, at 21:20, Victor Stinner <vstinner@redhat.com> wrote:
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Before 3.7.0 release, during the implementation of the UTF-8 Mode (PEP 540), I changed two things in Nick Coghlan's implementation of the C locale coercion (PEP 538):
(1) PYTHONCOERCECLOCALE environment variable is now ignored when -E or -I command line option is used.
(2) When Python is embeded, the C locale coercion is now enabled if the LC_CTYPE locale is "C".
Nick asked me to change the behavior: https://bugs.python.org/issue34589
I just pushed this change in the 3.7 branch which adds a new "-X coerce_c_locale=value" option: https://github.com/python/cpython/commit/144f1e2c6f4a24bd288c045986842c65cc2...
Examples using Pyhon 3.7 (future 3.7.1) with UTF-8 Mode disabled, to only test the C locale coercion: --- $ cat test.py import codecs, locale enc = locale.getpreferredencoding() enc = codecs.lookup(enc).name print(enc)
$ export LC_ALL= LC_CTYPE=C LANG=
# Disable C locale coercion: get ASCII as expected $ PYTHONCOERCECLOCALE=0 ./python -X utf8=0 test.py ascii
# -E ignores PYTHONCOERCECLOCALE=0: # C locale is coerced, we get UTF-8 $ PYTHONCOERCECLOCALE=0 ./python -E -X utf8=0 test.py utf-8
# -X coerce_c_locale=0 is not affected by -E: # C locale coercion disabled as expected, get ASCII as expected $ ./python -E -X utf8=0 -X coerce_c_locale=0 test.py ascii ---
For (1), Nick's use case is to get Python 3.6 behavior (C locale not coerced) on Python 3.7 using PYTHONCOERCECLOCALE. Nick proposed to use PYTHONCOERCECLOCALE even with -E or -I, but I dislike introducing a special case for -E option.
I chose to add a new "-X coerce_c_locale=0" to Python 3.7.1 to provide a solution for this use case. (Python 3.7.0 and older ignore this option.)
Note: Python 3.7.0 is fine with PYTHONCOERCECLOCALE=0, we are only talking about the special case of -E and -I options.
For (2), I modified Python 3.7.1 to make sure the C locale is never coerced when the C API is used to embed Python inside an application: Py_Initialize() and Py_Main(). The C locale can only be coerced by the official Python program ("python3.7").
I don't know if it should be possible to enable C locale coercion when Python is embedded. So I just made the change requested by Nick :-)
I dislike doing such late changes in 3.7.1, especially since PEP 538 has been designed by Nick Coghlan, and we disagree on the fix. But Ned Deily, our Python 3.7 release manager, wants to see last 3.7 fixes merged before Tuesday, so here we are.
Just because the 3.7.1rc is scheduled doesn't mean we should throw something in, particularly if it's not fully reviewed and fully agreed upon. If it's important enough, we could delay the rc a few days ... or decide to wait for 3.7.2.
Nick, Ned, INADA-san: are you ok with these changes? The other choices for 3.7.1 are:
* Revert my change: C locale coercion can still be enabled when Python is embedded, -E option ignores PYTHONCOERCECLOCALE env var.
* Revert my change and apply Nick's PR 9257: C locale coercion cannot be enabled when Python is embedded and -E option doesn't ignore PYTHONCOERCECLOCALE env var.
I spent months to fix the master branch to support all possible locales and encodings, and get a consistent CLI: https://vstinner.github.io/python3-locales-encodings.html
So I'm not excited by Nick's PR which IMHO moves Python backward, especially it breaks the -E option contract: it doesn't ignore PYTHONCOERCECLOCALE env var.
I would like to see Nick review the merged 3.7 PR and have both him and you agree that this is the thing to do for 3.7.1. I also want to make sure we understand what affect this will have on 3.7.0 users. Let's not potentially make things worse.
I'm not planning to tag 3.7.1rc for at least another 18 hours. I'm marking bpo-34589 as "release blocker" and I will not proceed until this is resolved.
Thanks! --Ned
-- Ned Deily nad@python.org -- []
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le mer. 19 sept. 2018 à 09:50, Nick Coghlan <ncoghlan@gmail.com> a écrit :
I think the changes to both master and the 3.7 branch should be reverted.
Ok, I prepared a PR to revert the 3.7 change: https://github.com/python/cpython/pull/9416
For 3.7, I already said that I think we should just accept that that ship has sailed with 3.7.0 and leave the as-shipped implementation alone for the rest of the 3.7 series: (...) For 3.8, (...), my PR should be conflict free again, and we'll be able to get PEP 538 back to working the way it was always supposed to work (...)
I read all your comments, and honestly, I don't understand you. Once you say: "we don't actually want anyone turning off locale coercion except for debugging purposes" https://bugs.python.org/issue34589#msg325554 but you also say that Python 3.7.0 is broken on Centos 7 because it's not possible to disable C locale coercion using -E flag: https://bugs.python.org/issue34589#msg325246 And here (your email), one more time, you insist to support "PYTHONCOERCECLOCALE=0 python3 -E". I don't understand if you want PYTHONCOERCECLOCALE to be ignored when using -E or not. Since the PEP 538 is something new, we don't have much feedback of users to know if it causes any troubles, so I agree that we should provide a way to disable the feature, as I provided a way to disable the UTF-8 Mode when the LC_CTYPE is C or POSIX. Just to give user a full control on locales and encodings. That's why I came up with a new -X coerce_c_locale option which can be used even with -E. I understood that you like the option, since you proposed to use it: https://bugs.python.org/issue34589#msg325493 -- Moreover, you asked me to make sure that Py_Initialize() and Py_Main() cannot enable C locale coercion. That's what I did. -- IMHO the implementation is really a secondary concern here, the main question is: what is the correct behavior? Nick: * Do we agree that we need to provide a way to disable C locale coercion (PEP 538) even when -E is used? * Do you agree that Py_Initialize() and Py_Main() must not enable the C locale coercion (PEP 538)? I understood that your reply is yes for the second question, since you insist to push your change which also prevent Py_Initialize() and Py_Main() to enable C locale coercion. If we change 3.7.0 behavior in 3.8, I would prefer to change the behavior in 3.7.1. IMHO it's not too late to *fix* 3.7. -- I decided to push a concrete implementation because I understood that you was ok for the -X coerce_c_locale option and you asked me to fix my mistakes. I feel guilty that I broke the implementation of your PEP :-( Moreover, I'm also exhausted by fixing locales and encodings, I'm doing that for one year now, and I expected many times that I was done with all regressions and corner cases... We are discussing these issues since 3 weeks and we failed to fix them, whereas Ned asked to push last fixes before 3.7.1. I sent an email to make sure that we all agree on the solution. Well, it seems like again, we failed to agree on the expected *behavior*. Victor

IMHO the implementation is really a secondary concern here, the main question is: what is the correct behavior?
Nick:
* Do we agree that we need to provide a way to disable C locale coercion (PEP 538) even when -E is used? * Do you agree that Py_Initialize() and Py_Main() must not enable the C locale coercion (PEP 538)?
I understood that your reply is yes for the second question, since you insist to push your change which also prevent Py_Initialize() and Py_Main() to enable C locale coercion.
Hum, I'm not sure if I explained properly my opinion on these questions. I consider that Python 3.7.0 introduced a regression compared to Python 3.6: it changes the LC_CTYPE locale for Python and all child processes and it's not possible to opt-out for that when using -E command line option. I proposed (and implemented) -X coerce_c_locale=0 for that. Unicode and locales are so hard to get right that I consider that it's important that we provide an option to opt-out,. Otherwise, someone will find an use case where Python 3.7 doesn't behave as expected and break one specific use case. I didn't notice a complain yet, but there are very few Python 3.7 users at this point. For example, very few Linux distributions use it yet. I consider that PYTHONCOERCECLOCALE must not introduce an exception in -E: it must be ignored when -E or -I is used. For security reasons, it's important to really ignore all PYTHON* environment variables. "Unicode" (in general) has been abused in the past to exploit vulnerabilities in applications. Locales and encodings are so hard, that it's easy to mess up and introduce a vulnerability just caused by encodings. It's also important to get deterministic and reproducible programs. For Py_Initialize() and Py_Main(): I have no opinion, so I rely on Nick's request to make sure that the C locale is not coerced when Python is embeded :-) Victor

Ned, Nick, Victor, There's an issue with the new PEP 567 (contextvars) C API. Currently it's designed to expose "PyContext*" and "PyContextVar*" pointers. I want to change that to "PyObject*" as using non-PyObject pointers turned out to be a very bad idea (interfacing with Cython is particularly challenging). Is it a good idea to change this in Python 3.7.1? Yury

On Sep 19, 2018, at 13:30, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
Ned, Nick, Victor,
There's an issue with the new PEP 567 (contextvars) C API.
Currently it's designed to expose "PyContext*" and "PyContextVar*" pointers. I want to change that to "PyObject*" as using non-PyObject pointers turned out to be a very bad idea (interfacing with Cython is particularly challenging).
Is it a good idea to change this in Python 3.7.1?
It's hard to make an informed decision without a concrete PR to review. What would be the impact on any user code that has already adopted it in 3.7.0? -- Ned Deily nad@python.org -- []

On Wed, Sep 19, 2018 at 4:26 PM Ned Deily <nad@python.org> wrote:
On Sep 19, 2018, at 13:30, Yury Selivanov <yselivanov.ml@gmail.com> wrote: [..]
Currently it's designed to expose "PyContext*" and "PyContextVar*" pointers. I want to change that to "PyObject*" as using non-PyObject pointers turned out to be a very bad idea (interfacing with Cython is particularly challenging).
Is it a good idea to change this in Python 3.7.1?
It's hard to make an informed decision without a concrete PR to review. What would be the impact on any user code that has already adopted it in 3.7.0?
Ned, I've created an issue to track this: https://bugs.python.org/issue34762 Yury

On Wed, 19 Sep 2018 at 22:07, Victor Stinner <vstinner@redhat.com> wrote:
IMHO the implementation is really a secondary concern here, the main question is: what is the correct behavior?
Nick:
* Do we agree that we need to provide a way to disable C locale coercion (PEP 538) even when -E is used? * Do you agree that Py_Initialize() and Py_Main() must not enable the C locale coercion (PEP 538)?
I understood that your reply is yes for the second question, since you insist to push your change which also prevent Py_Initialize() and Py_Main() to enable C locale coercion.
Hum, I'm not sure if I explained properly my opinion on these questions.
I consider that Python 3.7.0 introduced a regression compared to Python 3.6: it changes the LC_CTYPE locale for Python and all child processes and it's not possible to opt-out for that when using -E command line option.
This *wasn't* broken in the original PEP 538 implementation - it was only broken when you ignored the PEP and tried to make everything work the same way PEP 540 did, including moving the coercion out of the Python CLI and into the runtime library APIs. I still think the locale coercion handling in Python 3.7.x is broken, but adding MORE code is NOT the right answer: going back to the original (correct) implementation is. So changing it back to the way the PEP is supposed to work is fine, making everything more complicated for no good reason whatsoever is not fine. What changed is the fact I decided it wasn't worth holding up 3.7.1 over (and it certainly isn't worth adding a new -X option in a point release). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le mardi 18 septembre 2018, Victor Stinner <vstinner@redhat.com> a écrit :
Hi Unicode and locales lovers,
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Nick asked me to revert, which means that no, he is not ok with these changes. I reverted my change in 3.7. Victor

On Sep 19, 2018, at 15:08, Victor Stinner <vstinner@redhat.com> wrote:
Le mardi 18 septembre 2018, Victor Stinner <vstinner@redhat.com> a écrit :
Hi Unicode and locales lovers,
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Nick asked me to revert, which means that no, he is not ok with these changes.
I reverted my change in 3.7.
Thank you, Victor! Nick, with regard to this does the current state of the 3.7 branch look acceptable now for a 3.7.1? -- Ned Deily nad@python.org -- []

On Thu, 20 Sep 2018 at 06:48, Ned Deily <nad@python.org> wrote:
On Sep 19, 2018, at 15:08, Victor Stinner <vstinner@redhat.com> wrote:
Le mardi 18 septembre 2018, Victor Stinner <vstinner@redhat.com> a écrit :
Hi Unicode and locales lovers,
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Nick asked me to revert, which means that no, he is not ok with these changes.
I reverted my change in 3.7.
Thank you, Victor!
Nick, with regard to this does the current state of the 3.7 branch look acceptable now for a 3.7.1?
It's still broken relative to the PEP in the following respects: - Py_Initialize() coerces the C locale to C.UTF-8, even though it's not supposed to - Py_Main() coerces the C locale to C.UTF-8, even though it's not supposed to - PYTHONCOERCECLOCALE=0 doesn't work if -E or -I are passed on the command line (but it's supposed to) - PYTHONCOERCECLOCALE=warn doesn't work if -E or -I are passed on the command line (it's nominally supposed to do this too, but I'm less concerned about this one) The problem with Victor's patch is that instead of reverting to the as-designed-and-accepted PEP the way my PR (mostly) does, it instead introduces a whole new command line option (which then needs to be documented and tested), and still coerces *far* too late (not until Py_Initialise is already running, after who knows how much code in the embedding application has already executed). I don't have the time required to push through Victor's insistence that -E and -I are sacrosanct and must always be respected (despite PEP 538 explicitly saying that they won't be where PYTHONCOERCECLOCALE is concerned), and so we can't *possibly* change back to having the locale coercion work the way I originally implemented it, so I wrote the 3.7.x series off as a lost cause, and decided to devote my energies to getting things back to the way they were supposed to be for 3.8+. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Thu, 20 Sep 2018 at 20:20, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Thu, 20 Sep 2018 at 06:48, Ned Deily <nad@python.org> wrote:
On Sep 19, 2018, at 15:08, Victor Stinner <vstinner@redhat.com> wrote:
Le mardi 18 septembre 2018, Victor Stinner <vstinner@redhat.com> a écrit :
Hi Unicode and locales lovers,
tl; dr Nick, Ned, INADA-san: I modified 3.7.1 to add a new "-X coerce_c_locale=value" option and make sure that the C locale coercion cannot be when Python in embedded: are you ok with these changes?
Nick asked me to revert, which means that no, he is not ok with these changes.
I reverted my change in 3.7.
Thank you, Victor!
Nick, with regard to this does the current state of the 3.7 branch look acceptable now for a 3.7.1?
It's still broken relative to the PEP in the following respects:
- Py_Initialize() coerces the C locale to C.UTF-8, even though it's not supposed to - Py_Main() coerces the C locale to C.UTF-8, even though it's not supposed to - PYTHONCOERCECLOCALE=0 doesn't work if -E or -I are passed on the command line (but it's supposed to) - PYTHONCOERCECLOCALE=warn doesn't work if -E or -I are passed on the command line (it's nominally supposed to do this too, but I'm less concerned about this one)
It's worth noting that even though the PYTHONCOERCECLOCALE=0 off switch doesn't currently work as described in PEP 538 when passing -E or -I, setting "LC_ALL=C" does (since that's handled by the C library, independently of any CPython command line flags). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Ned Deily
-
Nick Coghlan
-
Victor Stinner
-
Yury Selivanov