[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Mon Sep 5 15:34:33 EDT 2016

I have some suggestions. With ReadConsoleW, CPython can use the
pInputControl parameter to set a CtrlWakeup mask. This enables a
Unix-style Ctrl+D for ending a read without having to press enter. For
example:

    >>> CTRL_MASK = 1 << 4
    >>> inctrl = (ctypes.c_ulong * 4)(16, 0, CTRL_MASK, 0)
    >>> _ = kernel32.ReadConsoleW(hStdIn, buf, 100, pn, inctrl); print()
    spam
    >>> buf.value
    'spam\x04'
    >>> pn[0]
    5

read() would have to manually replace '\x04' with NUL. Ctrl+Z can also
be added to the mask:

    >>> CTRL_MASK = 1 << 4 | 1 << 26
    >>> inctrl = (ctypes.c_ulong * 4)(16, 0, CTRL_MASK, 0)
    >>> _ = kernel32.ReadConsoleW(hStdIn, buf, 100, pn, inctrl); print()
    spam
    >>> buf.value
    'spam\x1a'

I'd like a method to query, set and unset
ENABLE_VIRTUAL_TERMINAL_PROCESSING mode for the screen buffer
(sys.stdout and sys.stderr) without having to use ctypes. The console
in Windows 10 has built-in VT100 emulation, but it's initially
disabled. The cmd shell enables it, but Python scripts aren't always
run from cmd.exe. Sometimes they're run in a new console from Explorer
or via "start", etc. For example, IPython could check for this to
provide more bells and whistles when PyReadline isn't installed.

Finally, functions such as WriteConsoleInputW and
ReadConsoleOutputCharacter require opening CONIN$ or CONOUT$ with
GENERIC_READ | GENERIC_WRITE access. The initial handles given to a
console process have read-write access. For opening a new handle by
device name, WindowsConsoleIO should first try GENERIC_READ |
GENERIC_WRITE -- with a fallback to either GENERIC_READ or
GENERIC_WRITE. The fallback is necessary for CON, which uses the
desired access to determine whether to open the input buffer or screen
buffer.

---

Paul, do you have example code that uses the 'raw' stream? Using the
buffer should behave as it always has -- at least in this regard.
sys.stdin.buffer requests a large block, such as 8 KB. But since the
console defaults to a cooked mode (i.e. processed input and line input
-- control keys, command-line editing, input history, and aliases),
ReadConsole returns when enter is pressed or when interrupted. It
returns at least '\r\n', unless interrupted by Ctrl+C, Ctrl+Break or a
custom CtrlWakeup key. However, if line-input mode is disabled,
ReadConsole returns as soon as one or more characters is available in
the input buffer.

As to kbhit() returning true, this does not mean that read(1) from
console input won't block (not unless line-input mode is disabled). It
does mean that getwch() won't block (note the "w" in there; this one
reads Unicode characters).The CRT's conio functions (e.g. kbhit,
getwch) put the console input buffer in a raw mode (e.g. ^C is read as
'\x03' instead of generating a CTRL_C_EVENT) and call the lower-level
functions PeekConsoleInputW (kbhit) and ReadConsoleInputW (getwch), to
peek at and read input event records.

---

Splitting surrogate pairs across reads is a problem. Granted, this
should rarely be an issue given the size of the reads that the buffer
requests and the typical line length. In most cases the buffer
completely consumes the entire line in one read. But in principle the
raw stream shouldn't replace split surrogates with the U+FFFD
replacement character. For example, with Steve's patch from issue
1602:

    >>> _ = write_console_input('\U00010000\r\n');\
    ... b1 = raw_read(4); b2 = raw_read(4); b3 = raw_read(8)
    𐀀
    >>> b1, b2
    (b'\xef\xbf\xbd', b'\xef\xbf\xbd')

Splitting UTF-8 sequences across writes is more common. Currently a
raw write doesn't handle this correctly:

    >>> b = 'eggs \U00010000 spam\n'.encode('utf-8')
    >>> _ = raw_write(b[:6]); _ = raw_write(b[6:])
    eggs ���� spam

Also, the console is UCS-2, which can't be transcoded between UTF-16
and UTF-8. Supporting UCS-2 in the console would integrate nicely with
the filesystem PEP. It makes it always possible to print
os.listdir('.'), copy and paste, and read it back without data loss.

It would probably be simpler to use UTF-16 in the main pipeline and
implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16
buffer could be renamed as "wbuffer", for expert use. However, if
you're fully committed to transcoding in the raw layer, I'm certain
that these problems can be addressed with small buffers and using
Python's codec machinery for a flexible mix of "surrogatepass" and
"replace" error handling.