Re: [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
On 5 September 2016 at 14:36, Steve Dower
The best fix is to use a buffered reader, which will read all the available bytes and then let you .read(1), even if it happens to be an incomplete character.
But this is sys.stdin.buffer.raw, we're talking about. People can't really layer anything on top of that, it's precisely because they are trying to *bypass* the existing layering (that doesn't work the way that they need it to, because it blocks) that is the problem here.
We could theoretically add buffering to the raw reader to handle one character, which would allow very small reads from raw, but that severely complicates things and the advice to use a buffered reader is good advice anyway.
Can you provide an example of how I'd rewrite the code that I quoted previously to follow this advice? Note - this is not theoretical, I expect to have to provide a PR to fix exactly this code should this change go in. At the moment I can't find a way that doesn't impact the (currently working and not expected to need any change) Unix version of the code, most likely I'll have to add buffering of 4-byte reads (which as you say is complex). The problem I have is that we're forcing application code to do the buffering to cater for Windows (where you're proposing that the raw IO layer doesn't handle it and will potentially fail reads of <4 bytes). Code written for POSIX doesn't need to do that, and the additional maintenance overhead is potentially large enough to put POSIX developers off adding the necessary code - this is in direct contrast to the proposal to make fsencoding UTF-8 to make it easier for POSIX-compatible code to "just work" on Windows. If the goals are to handle Unicode correctly for stdin, and to work in a way that POSIX-compatible code works without special effort on Windows, then as far as I can see we have to handle the buffering of partial reads of UTF-8 code sequences (because POSIX does so). If, on the other hand, we just want Unicode to work on Windows, and we're not looking for POSIX code to work without change, then the proposed behaviour is OK (although I still maintain it needs to be flagged, as it's very close to being a compatibility break in practice, even if it's technically within the rules). Paul PS I'm not 100% sure that under POSIX read() will return partial UTF-8 byte sequences. I think it must, because otherwise a lot of code I've seen would be broken, but if a POSIX expert can confirm or deny my assumption, that would be great.
On 05Sep2016 0941, Paul Moore wrote:
On 5 September 2016 at 14:36, Steve Dower
wrote: The best fix is to use a buffered reader, which will read all the available bytes and then let you .read(1), even if it happens to be an incomplete character.
But this is sys.stdin.buffer.raw, we're talking about. People can't really layer anything on top of that, it's precisely because they are trying to *bypass* the existing layering (that doesn't work the way that they need it to, because it blocks) that is the problem here.
This layer also blocks, and always has. You need to go to platform specific functions anyway to get non-blocking functionality (which is also wrapped up in getc I believe, but that isn't used by FileIO or the new WinConsoleIO classes).
We could theoretically add buffering to the raw reader to handle one character, which would allow very small reads from raw, but that severely complicates things and the advice to use a buffered reader is good advice anyway.
Can you provide an example of how I'd rewrite the code that I quoted previously to follow this advice? Note - this is not theoretical, I expect to have to provide a PR to fix exactly this code should this change go in. At the moment I can't find a way that doesn't impact the (currently working and not expected to need any change) Unix version of the code, most likely I'll have to add buffering of 4-byte reads (which as you say is complex).
The easiest way to follow it is to use "sys.stdin.buffer.read(1)" rather than "sys.stdin.buffer.raw.read(1)".
PS I'm not 100% sure that under POSIX read() will return partial UTF-8 byte sequences. I think it must, because otherwise a lot of code I've seen would be broken, but if a POSIX expert can confirm or deny my assumption, that would be great.
I just tested, and yes it returns partial characters. That's a good reason to do the single character buffering ourselves. Shouldn't be too hard to deal with. Cheers, Steve
On 5 September 2016 at 18:38, Steve Dower
Can you provide an example of how I'd rewrite the code that I quoted previously to follow this advice? Note - this is not theoretical, I expect to have to provide a PR to fix exactly this code should this change go in. At the moment I can't find a way that doesn't impact the (currently working and not expected to need any change) Unix version of the code, most likely I'll have to add buffering of 4-byte reads (which as you say is complex).
The easiest way to follow it is to use "sys.stdin.buffer.read(1)" rather than "sys.stdin.buffer.raw.read(1)".
I may have got confused here. If I say sys.stdin.buffer.read(1), having first checked via kbhit() that there's a character available[1], then I will always get 1 byte returned, never the "buffer too small to return a full character" error that you talk about in the PEP? If so, then I don't understand when the error you propose will be raised (unless your comment here is based on what you say below that we'll now buffer and therefore the error is no longer needed).
PS I'm not 100% sure that under POSIX read() will return partial UTF-8 byte sequences. I think it must, because otherwise a lot of code I've seen would be broken, but if a POSIX expert can confirm or deny my assumption, that would be great.
I just tested, and yes it returns partial characters. That's a good reason to do the single character buffering ourselves. Shouldn't be too hard to deal with.
OK, cool. Again I'm slightly confused because isn't this what you said before "severely complicates things" - or was that only for the raw layer? I was over-simplifying the issue for pyinvoke, which in practice is complicated by not yet having completed the process disentangling bytes/unicode handling. As a result, I was handwaving somewhat about whether read() is called on a raw stream or a buffered stream - in practice I'm sure I can manage as long as the buffered level still handles single-byte reads. One thing I did think of, though - if someone *is* working at the raw IO level, they have to be prepared for the new "buffer too small to return a full character" error. That's OK. But what if they request reading 7 bytes, but the input consists of 6 character s that encode to 1 byte in UTF-8, followed by a character that encodes to 2 bytes? You can return 6 bytes, that's fine - but you'll presumably still need to read the extra character before you can determine that it won't fit - so you're still going to have to buffer to some degree, surely? I guess this is implementation details, though - I'll try to find some time to read the patch in order to understand this. It's not something that matters in terms of the PEP anyway, it's an implementation detail. Cheers, Paul [1] Yes, I know that's not the best approach, but it's as good as we get without adding rather too much scary Windows specific code. (The irony of trying to get this right given how much low-level Unix code I don't follow is already in there doesn't escape me :-()
On 05Sep2016 1110, Paul Moore wrote:
On 5 September 2016 at 18:38, Steve Dower
wrote: Can you provide an example of how I'd rewrite the code that I quoted previously to follow this advice? Note - this is not theoretical, I expect to have to provide a PR to fix exactly this code should this change go in. At the moment I can't find a way that doesn't impact the (currently working and not expected to need any change) Unix version of the code, most likely I'll have to add buffering of 4-byte reads (which as you say is complex).
The easiest way to follow it is to use "sys.stdin.buffer.read(1)" rather than "sys.stdin.buffer.raw.read(1)".
I may have got confused here. If I say sys.stdin.buffer.read(1), having first checked via kbhit() that there's a character available[1], then I will always get 1 byte returned, never the "buffer too small to return a full character" error that you talk about in the PEP? If so, then I don't understand when the error you propose will be raised (unless your comment here is based on what you say below that we'll now buffer and therefore the error is no longer needed).
I don't think using buffer.read and kbhit together is going to be reliable anyway, as you may not have read everything that's already buffered yet. It's likely feasible if you flush everything, but otherwise it's a bit messy.
One thing I did think of, though - if someone *is* working at the raw IO level, they have to be prepared for the new "buffer too small to return a full character" error. That's OK. But what if they request reading 7 bytes, but the input consists of 6 character s that encode to 1 byte in UTF-8, followed by a character that encodes to 2 bytes? You can return 6 bytes, that's fine - but you'll presumably still need to read the extra character before you can determine that it won't fit - so you're still going to have to buffer to some degree, surely? I guess this is implementation details, though - I'll try to find some time to read the patch in order to understand this. It's not something that matters in terms of the PEP anyway, it's an implementation detail.
If you do raw.read(7), we internally do "7 / 4" and decide to only read one wchar_t from the console. So the returned bytes will be between 1 and 4 bytes long and there will be more info waiting for next time you ask. The only case we can reasonably handle at the raw layer is "n / 4" is zero but n != 0, in which case we can read and cache up to 4 bytes (one wchar_t) and then return those in future calls. If we try to cache any more than that we're substituting for buffered reader, which I don't want to do. Does caching up to one (Unicode) character at a time sound reasonable? I think that won't be much trouble, since there's no interference between system calls in that case and it will be consistent with POSIX behaviour. Cheers, Steve
On 5 September 2016 at 20:30, Steve Dower
The only case we can reasonably handle at the raw layer is "n / 4" is zero but n != 0, in which case we can read and cache up to 4 bytes (one wchar_t) and then return those in future calls. If we try to cache any more than that we're substituting for buffered reader, which I don't want to do.
Does caching up to one (Unicode) character at a time sound reasonable? I think that won't be much trouble, since there's no interference between system calls in that case and it will be consistent with POSIX behaviour.
Caching a single character sounds perfectly OK. As I noted previously, my use case probably won't need to work at the raw level anyway, so I no longer expect to have code that will break, but I think that a 1-character buffer ensuring that we avoid surprises for code that was written for POSIX is a good trade-off. Paul
On 05Sep2016 1308, Paul Moore wrote:
On 5 September 2016 at 20:30, Steve Dower
wrote: The only case we can reasonably handle at the raw layer is "n / 4" is zero but n != 0, in which case we can read and cache up to 4 bytes (one wchar_t) and then return those in future calls. If we try to cache any more than that we're substituting for buffered reader, which I don't want to do.
Does caching up to one (Unicode) character at a time sound reasonable? I think that won't be much trouble, since there's no interference between system calls in that case and it will be consistent with POSIX behaviour.
Caching a single character sounds perfectly OK. As I noted previously, my use case probably won't need to work at the raw level anyway, so I no longer expect to have code that will break, but I think that a 1-character buffer ensuring that we avoid surprises for code that was written for POSIX is a good trade-off.
So it works, though the behaviour is a little strange when you do it from the interactive prompt:
sys.stdin.buffer.raw.read(1) ɒprint('hi') b'\xc9' hi sys.stdin.buffer.raw.read(1) b'\x92'
What happens here is the raw.read(1) rounds one byte up to one character, reads the turned alpha, returns a single byte of the two byte encoded form and caches the second byte. Then interactive mode reads from stdin and gets the rest of the characters, starting from the print() and executes that. Finally the next call to raw.read(1) returns the cached second byte of the turned alpha. This is basically only a problem because the readline implementation is totally separate from the stdin object and doesn't know about the small cache (and for now, I think it's going to stay that way - merging readline and stdin would be great, but is a fairly significant task that won't make 3.6 at this stage). I feel like this is an acceptable edge case, as it will only show up when interleaving calls to raw.read(n < 4) with multibyte characters and input()/interactive prompts. We've taken the 99% compatible to 99.99% compatible, and I feel like going any further is practically certain to introduce bugs (I'm being very careful with the single character buffering, but even that feels risky). Hopefully others agree with my risk assessment here, but speak up if you think it's worthwhile trying to deal with this final case. Cheers, Steve
On Mon, Sep 5, 2016 at 9:45 PM, Steve Dower
So it works, though the behaviour is a little strange when you do it from the interactive prompt:
sys.stdin.buffer.raw.read(1) ɒprint('hi') b'\xc9' hi sys.stdin.buffer.raw.read(1) b'\x92'
What happens here is the raw.read(1) rounds one byte up to one character, reads the turned alpha, returns a single byte of the two byte encoded form and caches the second byte. Then interactive mode reads from stdin and gets the rest of the characters, starting from the print() and executes that. Finally the next call to raw.read(1) returns the cached second byte of the turned alpha.
This is basically only a problem because the readline implementation is totally separate from the stdin object and doesn't know about the small cache (and for now, I think it's going to stay that way - merging readline and stdin would be great, but is a fairly significant task that won't make 3.6 at this stage).
It needs to read a minimum of 2 codes in case the first character is a lead surrogate. It can use a length 2 WCHAR buffer and remember how many bytes have been written (for the general case -- not specifically for this case). Example failure using your 3rd patch: >>> _ = write_console_input("\U00010000print('hi')\r\n");\ ... raw_read(1) 𐀀print('hi') b'\xef' >>> File "<stdin>", line 1 �print('hi') ^ SyntaxError: invalid character in identifier >>> raw_read(1) b'\xbf' >>> raw_read(1) b'\xbd' The raw read captures the first surrogate code, and transcodes it as the replacement character b'\xef\xbf\xbd' (U+FFFD). Then PyOS_Readline captures the 2nd surrogate and decodes it as the replacement character. In the general case in which a lead surrogate is the last code read, but not at index 0, it can use the internal buffer to save the code for the next call. Surrogates that aren't in valid pairs should be allowed to pass through via surrogatepass. This aims for consistency with the filesystem encoding PEP.
participants (3)
-
eryk sun
-
Paul Moore
-
Steve Dower