[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Mon Sep 5 17:40:32 EDT 2016

On Mon, Sep 5, 2016 at 7:54 PM, Steve Dower <steve.dower at python.org> wrote:
> On 05Sep2016 1234, eryk sun wrote:
>>
>> Also, the console is UCS-2, which can't be transcoded between UTF-16
>> and UTF-8. Supporting UCS-2 in the console would integrate nicely with
>> the filesystem PEP. It makes it always possible to print
>> os.listdir('.'), copy and paste, and read it back without data loss.
>
> Supporting UTF-8 actually works better for this. We already use
> surrogatepass explicitly (on the filesystem side, with PEP 529) and
> implicitly (on the console side, using the Windows conversion API).

CP_UTF8 requires valid UTF-16 text. MultiByteToWideChar and
WideCharToMultiByte are of no practical use here. For example:

    >>> raw_read = sys.stdin.buffer.raw.read
    >>> _ = write_console_input('\ud800\ud800\r\n'); raw_read(16)
    ��
    b'\xef\xbf\xbd\xef\xbf\xbd\r\n'

This requires Python's "surrogatepass" error handler. It's also
required to decode UTF-8 that's potentially WTF-8 from
os.listdir(b'.'). Coming from the wild, there's a chance that
arbitrary bytes have invalid sequences other than lone surrogates, so
it needs to fall back on "replace" to deal with errors that
"surrogatepass" doesn't handle.

> Writing a partial character is easily avoidable by the user. We can either
> fail with an error or print garbage, and currently printing garbage is the
> most compatible behaviour. (Also occurs on Linux - I have a VM running this
> week for testing this stuff.)

Are you sure about that? The internal screen buffer of a Linux
terminal is bytes; it doesn't transcode to a wide-character format. In
the Unix world, almost everything is "get a byte, get a byte, get a
byte, byte, byte". Here's what I see in Ubuntu using GNOME Terminal,
for example:

    >>> raw_write = sys.stdout.buffer.raw.write
    >>> b = 'αβψδε\n'.encode()
    >>> b
    b'\xce\xb1\xce\xb2\xcf\x88\xce\xb4\xce\xb5\n'
    >>> for c in b: _ = raw_write(bytes([c]))
    ...
    αβψδε

Here it is on Windows with your patch:

    >>> raw_write = sys.stdout.buffer.raw.write
    >>> b = 'αβψδε\n'.encode()
    >>> b
    b'\xce\xb1\xce\xb2\xcf\x88\xce\xb4\xce\xb5\n'
    >>> for c in b: _ = raw_write(bytes([c]))
    ...
    ����������

For the write case this can be addressed by identifying an incomplete
sequence at the tail end and either buffering it as 'written' or
rejecting it for the user/buffer to try again with the complete
sequence. I think rejection isn't a good option when the incomplete
sequence starts at index 0. That should be buffered. I prefer
buffering in all cases.

>> It would probably be simpler to use UTF-16 in the main pipeline and
>> implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16
>> buffer could be renamed as "wbuffer", for expert use. However, if
>> you're fully committed to transcoding in the raw layer, I'm certain
>> that these problems can be addressed with small buffers and using
>> Python's codec machinery for a flexible mix of "surrogatepass" and
>> "replace" error handling.
>
> I don't think it actually makes things simpler. Having two buffers is
> generally a bad idea unless they are perfectly synced, which would be
> impossible here without data corruption (if you read half a utf-8 character
> sequence and then read the wide buffer, do you get that character or not?).

Martin's idea, as I understand it, is a UTF-8 buffer that reads from
and writes to the text wrapper. It necessarily consumes at least one
character and buffers it to allow reading per byte. Likewise for
writing, it buffers bytes until it can write a character to the text
wrapper. ISTM, it has to look for incomplete lead-continuation byte
sequences at the tail end, to hold them until the sequence is
complete, at which time it either decodes to a valid character or the
U+FFFD replacement character.

Also, I found that read(n) has to read a character at a time. That's
the only way to emulate line-input mode to detect "\n" and stop
reading. Technically this is implemented in a RawIOBase, which
dictates that operations should use a single system call, but since
it's interfacing with a text wrapper around a buffer around the actual
UCS-2 raw console stream, any notion of a 'system call' would be a
sham.

Because of the UTF-8 buffering there is a synchronization issue, but
it has character granularity. For example, when decoding UTF-8, you
don't get half of a surrogate pair. You decode the full character, and
write that as a discrete unit to the text wrapper. I'd have to
experiment to see how bad this can get. If it's too confusing the idea
isn't practical.

On the plus side, when working with text it's all native UCS-2 up to
the TextIOWrapper, so it's as efficient as possible, and as simple as
possible. You don't have to worry about transcoding and dealing with
partial surrogate pairs and partial UTF-8 sequences. All of that
complexity is exported to the pure-Python UTF-8 buffer mixin, but it's
not as bad there either because the interface is Text <=> WTF-8
instead of UCS-2 <=> WTF-8, and you don't have to worry about limiting
yourself to a single read or write. But that's detrimental for anyone
using the buffer's raw stream with the presumption that it does only
make one system call that's thread safe.