Python 3.5 now uses surrogateescape for the POSIX locale
Hi, I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977 New behaviour: --- $ mkdir z $ touch z/abcé $ LC_CTYPE=C ./python -c 'import os; print(os.listdir("z")[0])' abcé --- Old behaviour, before the change (test with Python 3.3): --- $ LC_CTYPE=C python3 -c 'import os; print(os.listdir("z")[0])' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128) --- The POSIX locale is common because it is used by default when no other locale is set. It's common that programs started by a crontab on UNIX and daemons are using this locale. Victor
On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:
Hi,
I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977
Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3). Cheers, Nick.
New behaviour: --- $ mkdir z $ touch z/abcé $ LC_CTYPE=C ./python -c 'import os; print(os.listdir("z")[0])' abcé ---
Old behaviour, before the change (test with Python 3.3): --- $ LC_CTYPE=C python3 -c 'import os; print(os.listdir("z")[0])' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128) ---
The POSIX locale is common because it is used by default when no other locale is set. It's common that programs started by a crontab on UNIX and daemons are using this locale.
Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:
Hi,
I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977
Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3).
Please don't cherry-pick this change in Fedora if it is not done in Python 3.4. It changes the behaviour of Python and I would prefer to have the same behaviour on the same Python version on all platforms. I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that. Victor
On 18 March 2014 19:13, Victor Stinner <victor.stinner@gmail.com> wrote:
2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:
Hi,
I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977
Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3).
Please don't cherry-pick this change in Fedora if it is not done in Python 3.4. It changes the behaviour of Python and I would prefer to have the same behaviour on the same Python version on all platforms.
I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that.
Well, the concern has always been the risk of silently generating bad data if there is a mismatch between the OS encoding and the stream encodings. That's why it took so long to make this change at all - we had to figure out that the underlying problem was really the ease with which even a properly configured Linux systems could end up running Python 3 code in the POSIX locale, and thus end up with improperly configured standard streams. Enabling "surrogateescape" by default only when the standard stream encoding is "ascii" helps to mitigate that risk, while still dealing with the main problem. I meant to try to get this into 3.4 (since a couple of the Fedora folks convinced me it was a problem), but there are only so many hours in the day, and it took me quite a while to fully grasp the actual problem. If folks are open to backporting this change to 3.4.1, then yes, I'd definitely prefer an upstream solution. Otherwise, it will be up to the Fedora Python maintainers to decide what they want to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
2014-03-18 10:48 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:
Well, the concern has always been the risk of silently generating bad data if there is a mismatch between the OS encoding and the stream encodings.
Data can be loaded from OS functions, from files and from stdin. These 3 sources may use various different and incompatible encodings. surrogateescape is used by OS functions, and now also by stdin when the POSIX locale is used. When the POSIX locale is used, OS functions and stdin can use different encodings if the PYTHONIOENCODING environment variable is used. Since we are consentent adults, I guess that you understand what you are doing when you set PYTHONIOENCODING. On Windows, the encoding of standard streams is the OEM code page, or the ANSI code page if a stream is redirected, it's unrelated to the LC_CTYPE locale. So surrogateecape can be used when if the encoding of standard streams is not ASCII. We may handle Windows differently to use strict even if the LC_CTYPE locale is "C". Note: On FreeBSD, Solaris and OpenIndiana, nl_langinfo(CODESET) announces an alias of the ASCII encoding when the LC_CTYPE locale is POSIX, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. Python 3 now uses the ASCII encoding for its "filesystem" (OS) encoding. Victor
Hello, 2014-03-18 18:13 GMT+09:00 Victor Stinner <victor.stinner@gmail.com>:
I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that.
FYI: Guido was opposed to change error handler of stdin and stdout years ago. http://bugs.python.org/issue2630#msg65493
Amaury: I think it would be okay to use backslashreplace as the default error handler for sys.stderr. Probably not for sys.stdout or other files, since I'm sure many users prefer the errors when their data cannot be printed rather than silently writing \u escapes that might cause other code reading their output to choke. For sys.stderr though I think not having exceptions raised when attempting to print errors is very valuable.
-- Atsuo Ishimoto Mail: ishimoto@gembook.org Twitter: atsuoishimoto
2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago.
This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions. Victor
Victor Stinner writes:
2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:
FYI: Guido was opposed to change error handler of stdin and stdout years ago.
This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions.
I would say, it's not different in the relevant aspect, which is spewing presumably unreadable bytes that may cause other code reading the output to choke. I would think backslashreplace would generally be preferred for these use cases.
participants (4)
-
Atsuo Ishimoto
-
Nick Coghlan
-
Stephen J. Turnbull
-
Victor Stinner