Mailman 3 Python 3.5 now uses surrogateescape for the POSIX locale - Python-Dev

newer
Issues about relative& absolute...

Python 3.5 now uses surrogateescape for the POSIX locale

older
Issues about relative& absolute...

Victor Stinner

March 18, 2014

1:54 a.m.

Hi, I modified Python 3.5 to use the "surrogateescape" error handler (PEP 383) for stdin and stdout when the LC_CTYPE locale is POSIX ("C" locale): http://bugs.python.org/issue19977 New behaviour: --- $ mkdir z $ touch z/abcé $ LC_CTYPE=C ./python -c 'import os; print(os.listdir("z")[0])' abcé --- Old behaviour, before the change (test with Python 3.3): --- $ LC_CTYPE=C python3 -c 'import os; print(os.listdir("z")[0])' Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128) --- The POSIX locale is common because it is used by default when no other locale is set. It's common that programs started by a crontab on UNIX and daemons are using this locale. Victor

Show replies by date

Nick Coghlan

March 2014

8:08 a.m.

On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:

...

Yay, thanks Victor. I'll let the Fedora folks know this has been merged, as we may seriously consider applying this as a vendor patch to our build of Python 3.4 (while I agree this isn't a bug fix, the current behaviour also poses a problem for Fedora as more core utilities start migrating to Python 3). Cheers, Nick.

...

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Victor Stinner

9:13 a.m.

2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

...

Please don't cherry-pick this change in Fedora if it is not done in Python 3.4. It changes the behaviour of Python and I would prefer to have the same behaviour on the same Python version on all platforms. I'm not against backporting the change in Python 3.4.1. It can be seen as a bugfix. I don't think that anyone wants a Unicode error when reading or printing non-ASCII data from stdin/to stdout. But I would like the opinion of other developers before doing that. Victor

Nick Coghlan

9:48 a.m.

On 18 March 2014 19:13, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Well, the concern has always been the risk of silently generating bad data if there is a mismatch between the OS encoding and the stream encodings. That's why it took so long to make this change at all - we had to figure out that the underlying problem was really the ease with which even a properly configured Linux systems could end up running Python 3 code in the POSIX locale, and thus end up with improperly configured standard streams. Enabling "surrogateescape" by default only when the standard stream encoding is "ascii" helps to mitigate that risk, while still dealing with the main problem. I meant to try to get this into 3.4 (since a couple of the Fedora folks convinced me it was a problem), but there are only so many hours in the day, and it took me quite a while to fully grasp the actual problem. If folks are open to backporting this change to 3.4.1, then yes, I'd definitely prefer an upstream solution. Otherwise, it will be up to the Fedora Python maintainers to decide what they want to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Victor Stinner

10:13 a.m.

2014-03-18 10:48 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

...

Data can be loaded from OS functions, from files and from stdin. These 3 sources may use various different and incompatible encodings. surrogateescape is used by OS functions, and now also by stdin when the POSIX locale is used. When the POSIX locale is used, OS functions and stdin can use different encodings if the PYTHONIOENCODING environment variable is used. Since we are consentent adults, I guess that you understand what you are doing when you set PYTHONIOENCODING. On Windows, the encoding of standard streams is the OEM code page, or the ANSI code page if a stream is redirected, it's unrelated to the LC_CTYPE locale. So surrogateecape can be used when if the encoding of standard streams is not ASCII. We may handle Windows differently to use strict even if the LC_CTYPE locale is "C". Note: On FreeBSD, Solaris and OpenIndiana, nl_langinfo(CODESET) announces an alias of the ASCII encoding when the LC_CTYPE locale is POSIX, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. Python 3 now uses the ASCII encoding for its "filesystem" (OS) encoding. Victor

Atsuo Ishimoto

10:02 a.m.

Hello, 2014-03-18 18:13 GMT+09:00 Victor Stinner <victor.stinner@gmail.com>:

...

FYI: Guido was opposed to change error handler of stdin and stdout years ago. http://bugs.python.org/issue2630#msg65493

...

-- Atsuo Ishimoto Mail: ishimoto@gembook.org Twitter: atsuoishimoto

Victor Stinner

10:15 a.m.

2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:

...

FYI: Guido was opposed to change error handler of stdin and stdout years ago.

http://bugs.python.org/issue2630#msg65493

This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions. Victor

Stephen J. Turnbull

4:35 a.m.

New subject: Python 3.5 now uses surrogateescape for the POSIX locale

Victor Stinner writes:

...

I would say, it's not different in the relevant aspect, which is spewing presumably unreadable bytes that may cause other code reading the output to choke. I would think backslashreplace would generally be preferred for these use cases.

Nick Coghlan

March 2014

8:08 a.m.

On 18 Mar 2014 11:56, "Victor Stinner" <victor.stinner@gmail.com> wrote:

...

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Victor Stinner

9:13 a.m.

2014-03-18 9:08 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

...

Nick Coghlan

9:48 a.m.

On 18 March 2014 19:13, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Victor Stinner

10:13 a.m.

2014-03-18 10:48 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

...

Atsuo Ishimoto

10:02 a.m.

Hello, 2014-03-18 18:13 GMT+09:00 Victor Stinner <victor.stinner@gmail.com>:

...

FYI: Guido was opposed to change error handler of stdin and stdout years ago. http://bugs.python.org/issue2630#msg65493

...

-- Atsuo Ishimoto Mail: ishimoto@gembook.org Twitter: atsuoishimoto

Victor Stinner

10:15 a.m.

2014-03-18 11:02 GMT+01:00 Atsuo Ishimoto <ishimoto@gembook.org>:

...

FYI: Guido was opposed to change error handler of stdin and stdout years ago.

http://bugs.python.org/issue2630#msg65493

This issue proposes to use "backslashreplace" error handler for stdout. This error handler is very different to "surrogateescape" which is related to PEP 383 and used by all OS functions. Victor

Stephen J. Turnbull

March 2014

4:35 a.m.

New subject: Python 3.5 now uses surrogateescape for the POSIX locale

Victor Stinner writes:

...

4017

Age (days ago)

4018

Last active (days ago)

List overview

Download

7 comments

4 participants

participants (4)

Atsuo Ishimoto
Nick Coghlan
Stephen J. Turnbull
Victor Stinner

Python 3.5 now uses surrogateescape for the POSIX locale

tags

participants (4)