Python 3.5 now uses surrogateescape for the POSIX locale

Hi, I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977 New behaviour: --- $ mkdir z $ touch z/abcé $ LC_CTYPE=C ./python -c 'import os; print(os.listdir("z")[0])' abcé --- Old behaviour, before the change (test with Python 3.3): --- $ LC_CTYPE=C python3 -c 'import os; print(os.listdir("z")[0])' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128) --- The POSIX locale is common because it is used by default when no other locale is set. It's common that programs started by a crontab on UNIX and daemons are using this locale. Victor

On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:
Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3). Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
Please don't cherry-pick this change in Fedora if it is not done in Python 3.4. It changes the behaviour of Python and I would prefer to have the same behaviour on the same Python version on all platforms. I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that. Victor

On 18 March 2014 19:13, Victor Stinner <victor.stinner@gmail.com> wrote:
Well, the concern has always been the risk of silently generating bad data if there is a mismatch between the OS encoding and the stream encodings. That's why it took so long to make this change at all - we had to figure out that the underlying problem was really the ease with which even a properly configured Linux systems could end up running Python 3 code in the POSIX locale, and thus end up with improperly configured standard streams. Enabling "surrogateescape" by default only when the standard stream encoding is "ascii" helps to mitigate that risk, while still dealing with the main problem. I meant to try to get this into 3.4 (since a couple of the Fedora folks convinced me it was a problem), but there are only so many hours in the day, and it took me quite a while to fully grasp the actual problem. If folks are open to backporting this change to 3.4.1, then yes, I'd definitely prefer an upstream solution. Otherwise, it will be up to the Fedora Python maintainers to decide what they want to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2014-03-18 10:48 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
Data can be loaded from OS functions, from files and from stdin. These 3 sources may use various different and incompatible encodings. surrogateescape is used by OS functions, and now also by stdin when the POSIX locale is used. When the POSIX locale is used, OS functions and stdin can use different encodings if the PYTHONIOENCODING environment variable is used. Since we are consentent adults, I guess that you understand what you are doing when you set PYTHONIOENCODING. On Windows, the encoding of standard streams is the OEM code page, or the ANSI code page if a stream is redirected, it's unrelated to the LC_CTYPE locale. So surrogateecape can be used when if the encoding of standard streams is not ASCII. We may handle Windows differently to use strict even if the LC_CTYPE locale is "C". Note: On FreeBSD, Solaris and OpenIndiana, nl_langinfo(CODESET) announces an alias of the ASCII encoding when the LC_CTYPE locale is POSIX, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. Python 3 now uses the ASCII encoding for its "filesystem" (OS) encoding. Victor

Hello, 2014-03-18 18:13 GMT+09:00 Victor Stinner <victor.stinner@gmail.com>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago. http://bugs.python.org/issue2630#msg65493
-- Atsuo Ishimoto Mail: ishimoto@gembook.org Twitter: atsuoishimoto

2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago.
This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions. Victor

Victor Stinner writes:
I would say, it's not different in the relevant aspect, which is spewing presumably unreadable bytes that may cause other code reading the output to choke. I would think backslashreplace would generally be preferred for these use cases.

On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:
Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3). Cheers, Nick.
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
Please don't cherry-pick this change in Fedora if it is not done in Python 3.4. It changes the behaviour of Python and I would prefer to have the same behaviour on the same Python version on all platforms. I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that. Victor

On 18 March 2014 19:13, Victor Stinner <victor.stinner@gmail.com> wrote:
Well, the concern has always been the risk of silently generating bad data if there is a mismatch between the OS encoding and the stream encodings. That's why it took so long to make this change at all - we had to figure out that the underlying problem was really the ease with which even a properly configured Linux systems could end up running Python 3 code in the POSIX locale, and thus end up with improperly configured standard streams. Enabling "surrogateescape" by default only when the standard stream encoding is "ascii" helps to mitigate that risk, while still dealing with the main problem. I meant to try to get this into 3.4 (since a couple of the Fedora folks convinced me it was a problem), but there are only so many hours in the day, and it took me quite a while to fully grasp the actual problem. If folks are open to backporting this change to 3.4.1, then yes, I'd definitely prefer an upstream solution. Otherwise, it will be up to the Fedora Python maintainers to decide what they want to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

2014-03-18 10:48 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
Data can be loaded from OS functions, from files and from stdin. These 3 sources may use various different and incompatible encodings. surrogateescape is used by OS functions, and now also by stdin when the POSIX locale is used. When the POSIX locale is used, OS functions and stdin can use different encodings if the PYTHONIOENCODING environment variable is used. Since we are consentent adults, I guess that you understand what you are doing when you set PYTHONIOENCODING. On Windows, the encoding of standard streams is the OEM code page, or the ANSI code page if a stream is redirected, it's unrelated to the LC_CTYPE locale. So surrogateecape can be used when if the encoding of standard streams is not ASCII. We may handle Windows differently to use strict even if the LC_CTYPE locale is "C". Note: On FreeBSD, Solaris and OpenIndiana, nl_langinfo(CODESET) announces an alias of the ASCII encoding when the LC_CTYPE locale is POSIX, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. Python 3 now uses the ASCII encoding for its "filesystem" (OS) encoding. Victor

Hello, 2014-03-18 18:13 GMT+09:00 Victor Stinner <victor.stinner@gmail.com>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago. http://bugs.python.org/issue2630#msg65493
-- Atsuo Ishimoto Mail: ishimoto@gembook.org Twitter: atsuoishimoto

2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago.
This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions. Victor

Victor Stinner writes:
I would say, it's not different in the relevant aspect, which is spewing presumably unreadable bytes that may cause other code reading the output to choke. I would think backslashreplace would generally be preferred for these use cases.
participants (4)
-
Atsuo Ishimoto
-
Nick Coghlan
-
Stephen J. Turnbull
-
Victor Stinner