
On Fri, Aug 12, 2016 at 2:20 PM, Random832 random832@fastmail.com wrote:
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
- force the console encoding to UTF-8 on initialize and revert on
finalize
So what are your concerns? Suggestions?
As far as I know, the single biggest problem caused by the status quo for console encoding is "some string containing characters not in the console codepage is printed out; unhandled UnicodeEncodeError". Is there any particular reason not to use errors='replace'?
If that's all you want then you can set PYTHONIOENCODING=:replace. Prepare to be inundated with question marks.
Python's 'cp*' encodings are cross-platform, so they don't call Windows NLS APIs. If you want a best-fit encoding, then 'mbcs' is the only choice. Use chcp.com to switch to your system's ANSI codepage and set PYTHONIOENCODING=mbcs:replace.
An 'oem' encoding could be added, but I'm no fan of these best-fit encodings. Writing question marks at least hints that the output is wrong.
Is there any particular reason for the REPL, when printing the repr of a returned object, not to replace characters not in the stdout encoding with backslash sequences?
sys.displayhook already does this. It falls back on sys_displayhook_unencodable if printing the repr raises a UnicodeEncodeError.
Does Python provide any mechanism to access the built-in "best fit" mappings for windows codepages (which mostly consist of removing accents from latin letters)?
As mentioned above, for output this is only available with 'mbcs'. For reading input via ReadFile or ReadConsoleA (and thus also C _read, fread, and fgets), the console already encodes its UTF-16 input buffer using a best-fit encoding to the input codepage. So there's no error in the following example, even though the result is wrong:
>>> sys.stdin.encoding 'cp437' >>> s = 'Ā' >>> s, ord(s) ('A', 65)
Jumping back to the codepage 65001 discussion, here's a function to simulate the bad output that Windows Vista and 7 users see:
def write(text): writes = [] n = 0 buffer = text.replace('\n', '\r\n').encode('utf-8') while buffer: decoded = buffer.decode('utf-8', 'replace') buffer = buffer[len(decoded):] writes.append(decoded.replace('\r', '\n')) return ''.join(writes)
For example:
>>> greek = 'αβγδεζηθι\n' >>> write(greek) 'αβγδεζηθι\n\n�ηθι\n\n�\n\n'
It gets worse with characters that require 3 bytes in UTF-8:
>>> devanagari = 'ऄअआइईउऊऋऌ\n' >>> write(devanagari) 'ऄअआइईउऊऋऌ\n\n�ईउऊऋऌ\n\n��ऋऌ\n\n��\n\n'
This problem doesn't exit in Windows 8+ because the old LPC-based communication (LPC is an undocumented protocol that's used extensively for IPC between Windows subsystems) with the console was rewritten to use a kernel driver (condrv.sys). Now it works like any other device by calling NtReadFile, NtWriteFile, and NtDeviceIoControlFile. Apparently in the rewrite someone fixed the fact that the conhost code that handles WriteFile and WriteConsoleA was incorrectly returning the number of UTF-16 codes written instead of the number of bytes.
Unfortunately the rewrite also broke Ctrl+C handling because ReadFile no longer sets the last error to ERROR_OPERATION_ABORTED when a console read is interrupted by Ctrl+C. I'm surprised so few Windows users have noticed or cared that Ctrl+C kills the REPL and misbehaves with input() in the Windows 8/10 console. The source of the Ctrl+C bug is an incorrect NTSTATUS code STATUS_ALERTED, which should be STATUS_CANCELLED. The console has always done this wrong, but before the rewrite there was common code for ReadFile and ReadConsole that handled STATUS_ALERTED specially. It's still there in ReadConsole, so Ctrl+C handling works fine in Unicode programs that use ReadConsoleW (e.g. cmd.exe, powershell.exe). It also works fine if win_unicode_console is enabled.
Finally, here's a ctypes example in Windows 10.0.10586 that shows the unsolvable problem with non-ASCII input when using codepage 65001:
import ctypes, msvcrt conin = open(r'\\.\CONIN$', 'r+') hConin = msvcrt.get_osfhandle(conin.fileno()) kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) nread = (ctypes.c_uint * 1)()
ASCII-only input works:
>>> buf = (ctypes.c_char * 100)() >>> kernel32.ReadFile(hConin, buf, 100, nread, None) spam 1 >>> nread[0], buf.value (6, b'spam\r\n')
But it returns EOF if "a" is replaced by Greek "α":
>>> buf = (ctypes.c_char * 100)() >>> kernel32.ReadFile(hConin, buf, 100, nread, None) spαm 1 >>> nread[0], buf.value (0, b'')
Notice that the read is successful but nread is 0. That signifies EOF. So the REPL will just silently quit as if you entered Ctrl+Z, and input() will raise EOFError. This can't be worked around. The problem is in conhost.exe, which assumes a request for N bytes wants N UTF-16 codes from the input buffer. This can only work with ASCII in UTF-8.