[Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Steve Dower
steve.dower at python.org
Mon Sep 5 17:45:13 EDT 2016
On 05Sep2016 1308, Paul Moore wrote:
> On 5 September 2016 at 20:30, Steve Dower <steve.dower at python.org> wrote:
>> The only case we can reasonably handle at the raw layer is "n / 4" is zero
>> but n != 0, in which case we can read and cache up to 4 bytes (one wchar_t)
>> and then return those in future calls. If we try to cache any more than that
>> we're substituting for buffered reader, which I don't want to do.
>>
>> Does caching up to one (Unicode) character at a time sound reasonable? I
>> think that won't be much trouble, since there's no interference between
>> system calls in that case and it will be consistent with POSIX behaviour.
>
> Caching a single character sounds perfectly OK. As I noted previously,
> my use case probably won't need to work at the raw level anyway, so I
> no longer expect to have code that will break, but I think that a
> 1-character buffer ensuring that we avoid surprises for code that was
> written for POSIX is a good trade-off.
So it works, though the behaviour is a little strange when you do it
from the interactive prompt:
>>> sys.stdin.buffer.raw.read(1)
ɒprint('hi')
b'\xc9'
>>> hi
>>> sys.stdin.buffer.raw.read(1)
b'\x92'
>>>
What happens here is the raw.read(1) rounds one byte up to one
character, reads the turned alpha, returns a single byte of the two byte
encoded form and caches the second byte. Then interactive mode reads
from stdin and gets the rest of the characters, starting from the
print() and executes that. Finally the next call to raw.read(1) returns
the cached second byte of the turned alpha.
This is basically only a problem because the readline implementation is
totally separate from the stdin object and doesn't know about the small
cache (and for now, I think it's going to stay that way - merging
readline and stdin would be great, but is a fairly significant task that
won't make 3.6 at this stage).
I feel like this is an acceptable edge case, as it will only show up
when interleaving calls to raw.read(n < 4) with multibyte characters and
input()/interactive prompts. We've taken the 99% compatible to 99.99%
compatible, and I feel like going any further is practically certain to
introduce bugs (I'm being very careful with the single character
buffering, but even that feels risky). Hopefully others agree with my
risk assessment here, but speak up if you think it's worthwhile trying
to deal with this final case.
Cheers,
Steve
More information about the Python-Dev
mailing list