On Mon, Jan 25, 2021, 4:25 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jan 24, 2021 at 10:43:54PM -0500, Matt Wozniski wrote:
> And
> `f.read(1)` needs to pick one of those and return it immediately. It can't
> wait for more information. The contract of `read` is "Read from underlying
> buffer until we have n characters or we hit EOF."

In text mode, reads are always buffered:


so `f.read(1)` will read as much as needed, so long as it only returns a
single character.

Text mode files are always backed by a buffer, yes, but that's not relevant. My point is that `f.read(1)` must immediately return a character if one exists in the buffer. It can't wait for more data to get buffered if there is already a buffered character, as that would be a backwards incompatible change that would badly break line based protocols like FTP, SMTP, and POP.

Up until now, `f.read(1)` has always read bytes from the underlying file descriptor into the buffer until it has one full character, and immediately returned it. And this is user facing behavior. Imagine an echo server that reads 1 character at a time and echoes it back, forever. The client will only ever send 1 character at a time, so if an eight bit locale encoding is in use the client will only send one byte before waiting for a response. As things stand today this works. If encoding detection were added and the server's call to `f.read(1)` could decide it doesn't know how to decode the first byte it gets and to block until more data comes in, that would be a deadlock, since the client isn't sending more.

A typical buffer size is 4096 bytes, or more.

Sure, but that doesn't mean that much data is always available. If something has written less than that, it's not reasonable to block until more data can be buffered in places where up until now no blocking would have occurred. Not least because no more data will necessarily ever come.

And if it were to instead make its decisions based on what has been buffered already, without ever blocking, then the behavior becomes nondeterministic: it could return a different character based on how much data the OS returned in the first read syscall.

In any case, I believe the intention of this proposal is for *open*, not
read, to perform the detection.

If that's the case, named pipes are a perfect example of why that's impossible. It's perfectly normal to open a named pipe that contains no data, and that won't until you trigger some action (say, spawning a child process that will write to it). You can't auto detect the encoding of an empty pipe, and you can't make open block until data arrives because it's entirely possible data will never arrive if open blocks.