Adding an 'errors' argument to print

Recently, I was working on a Windows GUI application that ends up running ffmpeg, and I wanted to see the command that was being run. However, the file name had a Unicode character in it (it's a Sawano song), and when I tried to print it to the console, it crashed during the encode/decode. (The encoding used in cmd doesn't support Unicode characters.) The workaround was to do: print(mystring.encode(sys.stdout.encoding, errors='replace).decode(sys.stdout.encoding)) Not fun, especially since this was *just* a debug print. The proposal: why not add an 'errors' argument to print? That way, I could've just done: print(mystring, errors='replace') without having to worry about it crashing. -- Ryan (ライアン) Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else http://refi64.com

On 24 March 2017 at 15:41, Ryan Gonzalez <rymg19@gmail.com> wrote:
When I've hit issues like this before, I've written a helper function: def sanitise(str, enc): """Ensure that str can be encoded in encoding enc""" return str.encode(enc, errors='replace').decode(enc) An errors argument to print would be very similar, but would only apply to the print function, whereas I've used my sanitise function in other situations as well. I understand the attraction of a dedicated "just print the best representation you can" argument to print, but I'm not sure it's a common enough need to be worth adding like this. Paul

*If* we change something, I would prefer to modify sys.stdout. The following issue proposes to add sys.stdout.set_encoding(errors='replace'): http://bugs.python.org/issue15216 You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr). Victor 2017-03-24 16:41 GMT+01:00 Ryan Gonzalez <rymg19@gmail.com>:

On Fri, Mar 24, 2017 at 9:37 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I like that.
You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr).
Great tip, I've needed this! -- --Guido van Rossum (python.org/~guido)

On 24 March 2017 at 16:37, Victor Stinner <victor.stinner@gmail.com> wrote:
I thought I recalled seeing something like that discussed somewhere. I agree that this is a better approach (even though it's not as granular as being able to specify on an individual print statement).
You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr).
That's something I didn't know. Thanks for the pointer! Paul

Le 24/03/2017 à 17:37, Victor Stinner a écrit :
This is not the same. You may want to locally apply "errors=replace" and not the whole program. Indeed, this can silence encoding problems. So I would probably never set in to errors at dev time except for the few places where I know I can explicitly silence errors. I quite like this print(errors="replace|ignore"). This is not going to cause any trouble, and can only help.

On Mon, Mar 27, 2017 at 8:52 PM, Barry <barry@barrys-emacs.org> wrote:
conhost.exe hosts the console, and chcp.com is a console app that calls GetConsoleCP, SetConsoleCP and SetConsoleOutputCP to show or modify the console's input and output codepages. It doesn't support changing them separately. cmd.exe is just another console client, no different from python.exe or powershell.exe in this regard. Also, it's unrelated to how Python uses the console, but for the record, cmd has used the console's wide-character API since it was ported from OS/2 in the early 90s. Back then the console was hosted using threads in the csrss.exe system process, which made sense because the windowing system was hosted there. When they moved most of the window manager to kernel mode in NT 4 (1996), the console was mostly left behind in csrss.exe. It wasn't until Windows 7 that it found a new home in conhost.exe. In Windows 8 it got a real device driver instead of using fake file handles. In Windows 10 it was updated to be less of a franken-window -- e.g. now it has line-wrapped selection and text reflowing. Using codepage 65001 (UTF-8) in a console app has a couple of annoying bugs in the console itself, and another due to flushing of C FILE streams. For example, reading text that has even a single non-ASCII character will fail because conhost's encoding buffer is too small. It handles the error by returning a read of 0 bytes. That's EOF, so Python's REPL quits; input() raises EOFError; and stdin.read() returns an empty string. Microsoft should fix this in Windows 10, and probably will eventually. The Linux subsystem needs UTF-8, and it's silly that the console doesn't allow entering non-ASCII text in Linux programs. As was already recommended, I suggest using the wide-character API via win_unicode_console in 2.7 and 3.5. In 3.6 we get the wide-character API automatically thanks to Steve Dower's io._WindowsConsoleIO class.

On 24 March 2017 at 15:41, Ryan Gonzalez <rymg19@gmail.com> wrote:
When I've hit issues like this before, I've written a helper function: def sanitise(str, enc): """Ensure that str can be encoded in encoding enc""" return str.encode(enc, errors='replace').decode(enc) An errors argument to print would be very similar, but would only apply to the print function, whereas I've used my sanitise function in other situations as well. I understand the attraction of a dedicated "just print the best representation you can" argument to print, but I'm not sure it's a common enough need to be worth adding like this. Paul

*If* we change something, I would prefer to modify sys.stdout. The following issue proposes to add sys.stdout.set_encoding(errors='replace'): http://bugs.python.org/issue15216 You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr). Victor 2017-03-24 16:41 GMT+01:00 Ryan Gonzalez <rymg19@gmail.com>:

On Fri, Mar 24, 2017 at 9:37 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
I like that.
You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr).
Great tip, I've needed this! -- --Guido van Rossum (python.org/~guido)

On 24 March 2017 at 16:37, Victor Stinner <victor.stinner@gmail.com> wrote:
I thought I recalled seeing something like that discussed somewhere. I agree that this is a better approach (even though it's not as granular as being able to specify on an individual print statement).
You can already set the PYTHONIOENCODING environment variable to ":replace" to use "replace" on sys.stdout (and sys.stderr).
That's something I didn't know. Thanks for the pointer! Paul

Le 24/03/2017 à 17:37, Victor Stinner a écrit :
This is not the same. You may want to locally apply "errors=replace" and not the whole program. Indeed, this can silence encoding problems. So I would probably never set in to errors at dev time except for the few places where I know I can explicitly silence errors. I quite like this print(errors="replace|ignore"). This is not going to cause any trouble, and can only help.

On Mon, Mar 27, 2017 at 8:52 PM, Barry <barry@barrys-emacs.org> wrote:
conhost.exe hosts the console, and chcp.com is a console app that calls GetConsoleCP, SetConsoleCP and SetConsoleOutputCP to show or modify the console's input and output codepages. It doesn't support changing them separately. cmd.exe is just another console client, no different from python.exe or powershell.exe in this regard. Also, it's unrelated to how Python uses the console, but for the record, cmd has used the console's wide-character API since it was ported from OS/2 in the early 90s. Back then the console was hosted using threads in the csrss.exe system process, which made sense because the windowing system was hosted there. When they moved most of the window manager to kernel mode in NT 4 (1996), the console was mostly left behind in csrss.exe. It wasn't until Windows 7 that it found a new home in conhost.exe. In Windows 8 it got a real device driver instead of using fake file handles. In Windows 10 it was updated to be less of a franken-window -- e.g. now it has line-wrapped selection and text reflowing. Using codepage 65001 (UTF-8) in a console app has a couple of annoying bugs in the console itself, and another due to flushing of C FILE streams. For example, reading text that has even a single non-ASCII character will fail because conhost's encoding buffer is too small. It handles the error by returning a read of 0 bytes. That's EOF, so Python's REPL quits; input() raises EOFError; and stdin.read() returns an empty string. Microsoft should fix this in Windows 10, and probably will eventually. The Linux subsystem needs UTF-8, and it's silly that the console doesn't allow entering non-ASCII text in Linux programs. As was already recommended, I suggest using the wide-character API via win_unicode_console in 2.7 and 3.5. In 3.6 we get the wide-character API automatically thanks to Steve Dower's io._WindowsConsoleIO class.
participants (8)
-
Barry
-
eryk sun
-
Guido van Rossum
-
Michel Desmoulin
-
Paul Moore
-
Ryan Gonzalez
-
Steven D'Aprano
-
Victor Stinner