[Python-Dev] PEP 528: Change Windows console encoding to UTF-8

Mon Sep 5 14:10:10 EDT 2016

On 5 September 2016 at 18:38, Steve Dower <steve.dower at python.org> wrote:
>> Can you provide an example of how I'd rewrite the code that I quoted
>> previously to follow this advice? Note - this is not theoretical, I
>> expect to have to provide a PR to fix exactly this code should this
>> change go in. At the moment I can't find a way that doesn't impact the
>> (currently working and not expected to need any change) Unix version
>> of the code, most likely I'll have to add buffering of 4-byte reads
>> (which as you say is complex).
>
> The easiest way to follow it is to use "sys.stdin.buffer.read(1)" rather
> than "sys.stdin.buffer.raw.read(1)".

I may have got confused here. If I say sys.stdin.buffer.read(1),
having first checked via kbhit() that there's a character
available[1], then I will always get 1 byte returned, never the
"buffer too small to return a full character" error that you talk
about in the PEP? If so, then I don't understand when the error you
propose will be raised (unless your comment here is based on what you
say below that we'll now buffer and therefore the error is no longer
needed).

>
>> PS I'm not 100% sure that under POSIX read() will return partial UTF-8
>> byte sequences. I think it must, because otherwise a lot of code I've
>> seen would be broken, but if a POSIX expert can confirm or deny my
>> assumption, that would be great.
>
> I just tested, and yes it returns partial characters. That's a good reason
> to do the single character buffering ourselves. Shouldn't be too hard to
> deal with.

OK, cool. Again I'm slightly confused because isn't this what you said
before "severely complicates things" - or was that only for the raw
layer?

I was over-simplifying the issue for pyinvoke, which in practice is
complicated by not yet having completed the process disentangling
bytes/unicode handling. As a result, I was handwaving somewhat about
whether read() is called on a raw stream or a buffered stream - in
practice I'm sure I can manage as long as the buffered level still
handles single-byte reads.

One thing I did think of, though - if someone *is* working at the raw
IO level, they have to be prepared for the new "buffer too small to
return a full character" error. That's OK. But what if they request
reading 7 bytes, but the input consists of 6 character s that encode
to 1 byte in UTF-8, followed by a character that encodes to 2 bytes?
You can return 6 bytes, that's fine - but you'll presumably still need
to read the extra character before you can determine that it won't fit
- so you're still going to have to buffer to some degree, surely? I
guess this is implementation details, though - I'll try to find some
time to read the patch in order to understand this. It's not something
that matters in terms of the PEP anyway, it's an implementation
detail.

Cheers,
Paul

[1] Yes, I know that's not the best approach, but it's as good as we
get without adding rather too much scary Windows specific code. (The
irony of trying to get this right given how much low-level Unix code I
don't follow is already in there doesn't escape me :-()