[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Mon Sep 5 17:45:13 EDT 2016

On 05Sep2016 1308, Paul Moore wrote:
> On 5 September 2016 at 20:30, Steve Dower <steve.dower at python.org> wrote:
>> The only case we can reasonably handle at the raw layer is "n / 4" is zero
>> but n != 0, in which case we can read and cache up to 4 bytes (one wchar_t)
>> and then return those in future calls. If we try to cache any more than that
>> we're substituting for buffered reader, which I don't want to do.
>>
>> Does caching up to one (Unicode) character at a time sound reasonable? I
>> think that won't be much trouble, since there's no interference between
>> system calls in that case and it will be consistent with POSIX behaviour.
>
> Caching a single character sounds perfectly OK. As I noted previously,
> my use case probably won't need to work at the raw level anyway, so I
> no longer expect to have code that will break, but I think that a
> 1-character buffer ensuring that we avoid surprises for code that was
> written for POSIX is a good trade-off.

So it works, though the behaviour is a little strange when you do it 
from the interactive prompt:

 >>> sys.stdin.buffer.raw.read(1)
ɒprint('hi')
b'\xc9'
 >>> hi
 >>> sys.stdin.buffer.raw.read(1)
b'\x92'
 >>>

What happens here is the raw.read(1) rounds one byte up to one 
character, reads the turned alpha, returns a single byte of the two byte 
encoded form and caches the second byte. Then interactive mode reads 
from stdin and gets the rest of the characters, starting from the 
print() and executes that. Finally the next call to raw.read(1) returns 
the cached second byte of the turned alpha.

This is basically only a problem because the readline implementation is 
totally separate from the stdin object and doesn't know about the small 
cache (and for now, I think it's going to stay that way - merging 
readline and stdin would be great, but is a fairly significant task that 
won't make 3.6 at this stage).

I feel like this is an acceptable edge case, as it will only show up 
when interleaving calls to raw.read(n < 4) with multibyte characters and 
input()/interactive prompts. We've taken the 99% compatible to 99.99% 
compatible, and I feel like going any further is practically certain to 
introduce bugs (I'm being very careful with the single character 
buffering, but even that feels risky). Hopefully others agree with my 
risk assessment here, but speak up if you think it's worthwhile trying 
to deal with this final case.

Cheers,
Steve