BufferedIO and detach
There doesn't seem to be a way to safely use detach() on stdin - I'd like to get down to the raw stream, but after calling detach(), the initial BufferedIOReader is unusable - so you cannot retrieve any buffered content) - and unless you detach(), you can't guarantee that the buffer will ever be empty. I presume I'm missing something, but if there was a read([n], buffered_only=False) call, which you could invoke with buffered_only=True, then it would be possible to get out of this situation. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
Robert Collins <robertc@...> writes:
There doesn't seem to be a way to safely use detach() on stdin - I'd like to get down to the raw stream, but after calling detach(), the initial BufferedIOReader is unusable - so you cannot retrieve any buffered content) - and unless you detach(), you can't guarantee that the buffer will ever be empty.
Presumably if you call it before anyone else has had a chance to read from it, you should be okay.
On 4 March 2013 05:16, Benjamin Peterson <benjamin@python.org> wrote:
Robert Collins <robertc@...> writes:
There doesn't seem to be a way to safely use detach() on stdin - I'd like to get down to the raw stream, but after calling detach(), the initial BufferedIOReader is unusable - so you cannot retrieve any buffered content) - and unless you detach(), you can't guarantee that the buffer will ever be empty.
Presumably if you call it before anyone else has had a chance to read from it, you should be okay.
Thats hard to guarantee in the general case: consider a library utility that accepts an input stream. To make it concrete, consider dispatching to different processors based on the first few bytes of a stream: you'd have to force raw IO handling everywhere, rather than just the portion of code that needs it... -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Sun, Mar 3, 2013 at 6:31 PM, Robert Collins <robertc@robertcollins.net> wrote:
On 4 March 2013 05:16, Benjamin Peterson <benjamin@python.org> wrote:
Robert Collins <robertc@...> writes:
There doesn't seem to be a way to safely use detach() on stdin - I'd like to get down to the raw stream, but after calling detach(), the initial BufferedIOReader is unusable - so you cannot retrieve any buffered content) - and unless you detach(), you can't guarantee that the buffer will ever be empty.
Presumably if you call it before anyone else has had a chance to read from it, you should be okay.
Thats hard to guarantee in the general case: consider a library utility that accepts an input stream. To make it concrete, consider dispatching to different processors based on the first few bytes of a stream: you'd have to force raw IO handling everywhere, rather than just the portion of code that needs it...
The solution would seem obvious: detach before reading anything from the stream. But apparently you're trying to come up with a reason why that's not enough. I think you're concerned about the situation where you have a stream of uncertain origin, and you want to switch to raw, unbuffered I/O. You realize that some of the bytes you are interested in might already have been read into the buffer. So you want access to the contents of the buffer. When the io module was originally designed, this was actually one of the (implied) use cases -- one reason I wanted to stop using C stdio was that I didn't like that there is no standard way to get at the data in the buffer, in similar use cases as you're trying to present. (A use case I could think of would be an http server that forks a subprocess after reading e.g. the first line of the http request, or perhaps after the headers.) It seems that the when the io module was rewritten in C for speed (and I am very grateful that it was, the Python version was way too slow) this use case, being pretty rare, was forgotten. In specific use cases it's usually easy enough to just open the file unbuffered, or detach before reading anything. Can you write C code? If so, perhaps you can come up with a patch. Personally, I'm not sure that your proposed API (a buffered_only flag to read()) is the best way to go about it. Maybe detach() should return the remaining buffered data? (Perhaps only if a new flag is given.) FWIW I think it's also possible that some of the data has made it into the text wrapper already, so you'll have to be able to extract it from there as well. (Good luck.) -- --Guido van Rossum (python.org/~guido)
Guido van Rossum <guido@...> writes:
When the io module was originally designed, this was actually one of the (implied) use cases -- one reason I wanted to stop using C stdio was that I didn't like that there is no standard way to get at the data in the buffer, in similar use cases as you're trying to present. (A use case I could think of would be an http server that forks a subprocess after reading e.g. the first line of the http request, or perhaps after the headers.)
What was the API that provided this in the Python version of the io module? (Note it still mostly lives as Lib/_pyio.py)
On Sunday, March 3, 2013, Benjamin Peterson wrote:
Guido van Rossum <guido@...> writes:
When the io module was originally designed, this was actually one of the (implied) use cases -- one reason I wanted to stop using C stdio was that I didn't like that there is no standard way to get at the data in the buffer, in similar use cases as you're trying to present. (A use case I could think of would be an http server that forks a subprocess after reading e.g. the first line of the http request, or perhaps after the headers.)
What was the API that provided this in the Python version of the io module?
I think it may not have ben more than accessing private instance variables. :-) (Note it still mostly lives as Lib/_pyio.py)
That won't help a concrete use case though, will it? --Guido -- --Guido van Rossum (python.org/~guido)
Guido van Rossum <guido@...> writes:
On Sunday, March 3, 2013, Benjamin Peterson wrote:Guido van Rossum <guido What was the API that provided this in the Python version of the io module?
I think it may not have ben more than accessing private instance variables.
It's a bit hard to claim that was ever a "supported" usecase then.
That won't help a concrete use case though, will it?
No, I was just pointing that out in case you wanted to reference it.
On Mon, Mar 4, 2013 at 1:41 PM, Benjamin Peterson <benjamin@python.org> wrote:
Guido van Rossum <guido@...> writes:
On Sunday, March 3, 2013, Benjamin Peterson wrote:Guido van Rossum <guido What was the API that provided this in the Python version of the io module?
I think it may not have ben more than accessing private instance variables.
It's a bit hard to claim that was ever a "supported" usecase then.
True, it was not supported, but it was *possible* (and I had *meant*) to support it by adding a new API to read what's in the buffer in a completely portable way. This was still a step forward compared to using stdio, where the hacks needed to access the buffer would vary by platform and libc version. And that's all I meant by that comment. -- --Guido van Rossum (python.org/~guido)
On 4 March 2013 18:50, Guido van Rossum <guido@python.org> wrote:
On Sun, Mar 3, 2013 at 6:31 PM, Robert Collins <robertc@robertcollins.net> wrote:
On 4 March 2013 05:16, Benjamin Peterson <benjamin@python.org> wrote:
Robert Collins <robertc@...> writes:
There doesn't seem to be a way to safely use detach() on stdin - I'd like to get down to the raw stream, but after calling detach(), the initial BufferedIOReader is unusable - so you cannot retrieve any buffered content) - and unless you detach(), you can't guarantee that the buffer will ever be empty.
Presumably if you call it before anyone else has had a chance to read from it, you should be okay.
Thats hard to guarantee in the general case: consider a library utility that accepts an input stream. To make it concrete, consider dispatching to different processors based on the first few bytes of a stream: you'd have to force raw IO handling everywhere, rather than just the portion of code that needs it...
The solution would seem obvious: detach before reading anything from the stream.
But apparently you're trying to come up with a reason why that's not enough. I think you're concerned about the situation where you have a stream of uncertain origin, and you want to switch to raw, unbuffered I/O. You realize that some of the bytes you are interested in might already have been read into the buffer. So you want access to the contents of the buffer.
Yes exactly. A little more context on how I came to ask the question. I wanted to accumulate all input on an arbitrary stream within 5ms, without blocking for longer. Using raw IO + select, its possible to loop, reading one byte at a time. The io module doesn't have an API (that I could find) for putting an existing stream into non-blocking mode, so reading a larger amount and taking what is returned isn't viable. However, without raw I/O, select() will timeout because it consults the underlying file descriptor bypassing the buffer. So - the only reason to want raw I/O is to be able to use select reliably. An alternative would be being able to drain the buffer with no underlying I/O calls at all, then use select + read1, then rinse and repeat.
When the io module was originally designed, this was actually one of the (implied) use cases -- one reason I wanted to stop using C stdio was that I didn't like that there is no standard way to get at the data in the buffer, in similar use cases as you're trying to present. (A use case I could think of would be an http server that forks a subprocess after reading e.g. the first line of the http request, or perhaps after the headers.)
Thats a very similar case as it happens - protocol handling is present in my use case too.
It seems that the when the io module was rewritten in C for speed (and I am very grateful that it was, the Python version was way too slow) this use case, being pretty rare, was forgotten. In specific use cases it's usually easy enough to just open the file unbuffered, or detach before reading anything.
Can you write C code? If so, perhaps you can come up with a patch. Personally, I'm not sure that your proposed API (a buffered_only flag to read()) is the best way to go about it. Maybe detach() should return the remaining buffered data? (Perhaps only if a new flag is given.)
FWIW I think it's also possible that some of the data has made it into the text wrapper already, so you'll have to be able to extract it from there as well. (Good luck.)
I can write C code, and if evolving the API is acceptable (it sounds like it is) I'll be more than happy to make a patch. Some variations I can think of... The buffer_only flag I suggested, on read_into, read1, read etc. Have detach return the buffered data as you suggest - that would be incompatible unless we stash it on the raw object somewhere, or do something along those lines. A read0 - analogous to read1, returns data from the buffer, but guarantees no underlying calls. I think exposing the buffer more explicitly is a good principle, independent of whether we change detach or not.
-- --Guido van Rossum (python.org/~guido)
-- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Mon, Mar 4, 2013 at 4:44 PM, Robert Collins <robertc@robertcollins.net> wrote:
Some variations I can think of...
The buffer_only flag I suggested, on read_into, read1, read etc.
Have detach return the buffered data as you suggest - that would be incompatible unless we stash it on the raw object somewhere, or do something along those lines.
A read0 - analogous to read1, returns data from the buffer, but guarantees no underlying calls.
I think exposing the buffer more explicitly is a good principle, independent of whether we change detach or not.
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer. That's actually one of the interesting problems with supporting a "set_encoding()" method on IO streams (see http://bugs.python.org/issue15216). How does the following API sound for your purposes? (this is based on what set_encoding() effectively has to do under the hood): BufferedReader: def push_data(binary_data): """Prepends contents of 'data' to the internal buffer""" def clear_buffer(): """Clears the internal buffer and returns the previous content as a bytes object""" TextIOWrapper: def push_data(char_data, binary_data=b""): """Prepends contents of 'data' to the internal buffer. If binary_data is provided, it is pushed into the underlying IO buffered reader. Raises UnsupportedOperation if the underlying stream has no "push_data" method.""" def clear_buffer(): """Clears the internal buffers and returns the previous content as a (char_data, binary_data) pair. The binary data includes any data that was queued inside the codec, as well as the contents of the underlying IO buffer""" Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4 March 2013 22:12, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Mar 4, 2013 at 4:44 PM, Robert Collins <robertc@robertcollins.net> wrote:
Some variations I can think of...
The buffer_only flag I suggested, on read_into, read1, read etc.
Have detach return the buffered data as you suggest - that would be incompatible unless we stash it on the raw object somewhere, or do something along those lines.
A read0 - analogous to read1, returns data from the buffer, but guarantees no underlying calls.
I think exposing the buffer more explicitly is a good principle, independent of whether we change detach or not.
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer. That's actually one of the interesting problems with supporting a "set_encoding()" method on IO streams (see http://bugs.python.org/issue15216).
Indeed. Fun! Caches are useful but add complexity :)
How does the following API sound for your purposes? (this is based on what set_encoding() effectively has to do under the hood):
BufferedReader:
def push_data(binary_data): """Prepends contents of 'data' to the internal buffer"""
def clear_buffer(): """Clears the internal buffer and returns the previous content as a bytes object"""
TextIOWrapper:
def push_data(char_data, binary_data=b""): """Prepends contents of 'data' to the internal buffer. If binary_data is provided, it is pushed into the underlying IO buffered reader. Raises UnsupportedOperation if the underlying stream has no "push_data" method."""
def clear_buffer(): """Clears the internal buffers and returns the previous content as a (char_data, binary_data) pair. The binary data includes any data that was queued inside the codec, as well as the contents of the underlying IO buffer"""
That would make the story of 'get me back to raw IO' straightforward, though the TextIOWrapper's clear_buffer semantics are a little unclear to me from just the docstring. I think having TextIOWrapper only return bytes from clear_buffer and only accept bytes in push_data would be simpler to reason about, if a little more complex on the internals. Now, one could implement 'read0' manually using read1 + clear_buffer + push_data: # first, unwrap back to a bytes layer buffer = textstream.buffer() buffer.push_data(textstream.clear_buffer[1]) def read0(n): data = buffer.clear_buffer() result = data[:n] buffer.push_data(data[n:]) return result But it might be more efficient to define read0 directly on BufferedIOReader. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On 4 Mar 2013 19:19, "Robert Collins" <robertc@robertcollins.net> wrote:
On 4 March 2013 22:12, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Mar 4, 2013 at 4:44 PM, Robert Collins <robertc@robertcollins.net> wrote:
Some variations I can think of...
The buffer_only flag I suggested, on read_into, read1, read etc.
Have detach return the buffered data as you suggest - that would be incompatible unless we stash it on the raw object somewhere, or do something along those lines.
A read0 - analogous to read1, returns data from the buffer, but guarantees no underlying calls.
I think exposing the buffer more explicitly is a good principle, independent of whether we change detach or not.
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer. That's actually one of the interesting problems with supporting a "set_encoding()" method on IO streams (see http://bugs.python.org/issue15216).
Indeed. Fun! Caches are useful but add complexity :)
How does the following API sound for your purposes? (this is based on what set_encoding() effectively has to do under the hood):
BufferedReader:
def push_data(binary_data): """Prepends contents of 'data' to the internal buffer"""
def clear_buffer(): """Clears the internal buffer and returns the previous content as a bytes object"""
TextIOWrapper:
def push_data(char_data, binary_data=b""): """Prepends contents of 'data' to the internal buffer. If binary_data is provided, it is pushed into the underlying IO buffered reader. Raises UnsupportedOperation if the underlying stream has no "push_data" method."""
def clear_buffer(): """Clears the internal buffers and returns the previous content as a (char_data, binary_data) pair. The binary data includes any data that was queued inside the codec, as well as the contents of the underlying IO buffer"""
That would make the story of 'get me back to raw IO' straightforward, though the TextIOWrapper's clear_buffer semantics are a little unclear to me from just the docstring. I think having TextIOWrapper only return bytes from clear_buffer and only accept bytes in push_data would be simpler to reason about, if a little more complex on the internals.
I originally had it defined that way, but as Victor points out in the set_encoding issue, decoding is potentially lossy in the general case, so we can't reliably convert already decoded characters back to bytes. The appropriate way to handle that is going to be application specific, so I changed the proposed API to produce a (str, bytes) 2-tuple. Cheers, Nick.
Now, one could implement 'read0' manually using read1 + clear_buffer + push_data: # first, unwrap back to a bytes layer buffer = textstream.buffer() buffer.push_data(textstream.clear_buffer[1]) def read0(n): data = buffer.clear_buffer() result = data[:n] buffer.push_data(data[n:]) return result
But it might be more efficient to define read0 directly on
BufferedIOReader.
-Rob
-- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On 4 March 2013 22:52, Nick Coghlan <ncoghlan@gmail.com> wrote:
I originally had it defined that way, but as Victor points out in the set_encoding issue, decoding is potentially lossy in the general case, so we can't reliably convert already decoded characters back to bytes. The appropriate way to handle that is going to be application specific, so I changed the proposed API to produce a (str, bytes) 2-tuple.
I don't quite follow - why would we need to convert decoded characters to bytes? While it is lossy, we know the original bytes. If we keep the original bytes around until their characters are out of the buffer, there is no loss window - and the buffer size in TextIOWrapper is quite small by default isn't it? If we need to be strictly minimal then yes, I can see why your tweaked API would be better. However - two bits of feedback : it should say more clearly that there is no overlap between the text and binary segments: any bytes that have been decoded are in the text segment and only in the text segment. push_data has a wart though, consider a TextIOWrapper with the following buffer: text="foo" binary=b"bar" when you call push_data("quux", b"baz") should you end up with text="quuxfoo" binary=b"bazbar" or text="quux" + b"baz".decode(self.encoding) + "foo" binary=b"bar" The latter is clearly the intent, but the docstring implies the former behaviour. (The latter case does depend on the bytestring being decodable on it's own when there is content in the text buffer - but even a complex buffer that is a sequence of text or byte regions would still have that requirement due to not being able to recode reliably). -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
Le Mon, 4 Mar 2013 19:12:03 +1000, Nick Coghlan <ncoghlan@gmail.com> a écrit :
On Mon, Mar 4, 2013 at 4:44 PM, Robert Collins <robertc@robertcollins.net> wrote:
Some variations I can think of...
The buffer_only flag I suggested, on read_into, read1, read etc.
Have detach return the buffered data as you suggest - that would be incompatible unless we stash it on the raw object somewhere, or do something along those lines.
A read0 - analogous to read1, returns data from the buffer, but guarantees no underlying calls.
I think exposing the buffer more explicitly is a good principle, independent of whether we change detach or not.
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer.
I'd prefer if TextIOWrapper was totally unsupported in that context. Regards Antoine.
On 4 March 2013 23:47, Antoine Pitrou <solipsis@pitrou.net> wrote:
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer.
I'd prefer if TextIOWrapper was totally unsupported in that context.
The problem is that sys.stdin and sys.stdout default to TextIOWrappers, and handling protocols requires bytes, so having a way to drop down to bytes is very convenient. Doing it by command line arguments to Python works as long as a command is always byte orientated (or never) - but thats a very big hammer. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Tue, 5 Mar 2013 10:11:12 +1300 Robert Collins <robertc@robertcollins.net> wrote:
On 4 March 2013 23:47, Antoine Pitrou <solipsis@pitrou.net> wrote:
As Guido noted, you actually have multiple layers of buffering to contend with - for a text stream, you may have already decoded characters and partially decoded data in the codec's internal buffer, in addition to any data in the IO buffer.
I'd prefer if TextIOWrapper was totally unsupported in that context.
The problem is that sys.stdin and sys.stdout default to TextIOWrappers, and handling protocols requires bytes, so having a way to drop down to bytes is very convenient.
Why do you want to drop to bytes *after* having already buffered stuff in sys.{stdin,stdout}? Regards Antoine.
On 5 March 2013 10:12, Antoine Pitrou <solipsis@pitrou.net> wrote:
The problem is that sys.stdin and sys.stdout default to TextIOWrappers, and handling protocols requires bytes, so having a way to drop down to bytes is very convenient.
Why do you want to drop to bytes *after* having already buffered stuff in sys.{stdin,stdout}?
I don't (when reading), and for my purposes having the drop-down process error when reads have been done at thetext layer would be fine. Writing is more ambiguous (for me, not as a problem statement), but also works fine today so nothing is needed from my perspective. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
Le Mon, 4 Mar 2013 19:44:27 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
Yes exactly. A little more context on how I came to ask the question. I wanted to accumulate all input on an arbitrary stream within 5ms, without blocking for longer. Using raw IO + select, its possible to loop, reading one byte at a time. The io module doesn't have an API (that I could find) for putting an existing stream into non-blocking mode, so reading a larger amount and taking what is returned isn't viable.
What do you mean exactly by that?
However, without raw I/O, select() will timeout because it consults the underlying file descriptor bypassing the buffer. So - the only reason to want raw I/O is to be able to use select reliably.
That's a pretty good reason actually. Raw I/O is exactly for those cases. Non-blocking buffered I/O is a hard conceptual problem: http://bugs.python.org/issue13322
An alternative would be being able to drain the buffer with no underlying I/O calls at all, then use select + read1, then rinse and repeat.
Have you tried peek()? Regards Antoine.
On 4 March 2013 22:59, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 4 Mar 2013 19:44:27 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
Yes exactly. A little more context on how I came to ask the question. I wanted to accumulate all input on an arbitrary stream within 5ms, without blocking for longer. Using raw IO + select, its possible to loop, reading one byte at a time. The io module doesn't have an API (that I could find) for putting an existing stream into non-blocking mode, so reading a larger amount and taking what is returned isn't viable.
What do you mean exactly by that?
Just what I said. I'll happily try to rephrase. What bit was unclear?
However, without raw I/O, select() will timeout because it consults the underlying file descriptor bypassing the buffer. So - the only reason to want raw I/O is to be able to use select reliably.
That's a pretty good reason actually. Raw I/O is exactly for those cases. Non-blocking buffered I/O is a hard conceptual problem: http://bugs.python.org/issue13322
Sure, it can get tricky to reason about. But - the whole point of libraries like io is to encapsulate common solutions to tricky things - so that we don't have a hundred incompatible not-quite-the-same layers sitting on top. Right now select + BufferedIOReader is plain buggy regardless of non-blocking or not... ; I'd like to fix that - for instance, if select consulted the buffer somehow and returned immediately if the buffer had data, that would be an improvement (as select doesn't say *how much* data can be read).
An alternative would be being able to drain the buffer with no underlying I/O calls at all, then use select + read1, then rinse and repeat.
Have you tried peek()?
Per http://docs.python.org/3.2/library/io.html#io.BufferedReader.peek peek may cause I/O. Only one, but still you cannot control it. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
Le Mon, 4 Mar 2013 23:15:36 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
On 4 March 2013 22:59, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 4 Mar 2013 19:44:27 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
Yes exactly. A little more context on how I came to ask the question. I wanted to accumulate all input on an arbitrary stream within 5ms, without blocking for longer. Using raw IO + select, its possible to loop, reading one byte at a time. The io module doesn't have an API (that I could find) for putting an existing stream into non-blocking mode, so reading a larger amount and taking what is returned isn't viable.
What do you mean exactly by that?
Just what I said. I'll happily try to rephrase. What bit was unclear?
I don't understand what you mean by "putting an existing stream into non-blocking mode"? What stream exactly is it? And why is reading a larger amount not viable? Regards Antoine.
On 4 March 2013 23:45, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 4 Mar 2013 23:15:36 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
On 4 March 2013 22:59, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Mon, 4 Mar 2013 19:44:27 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
Yes exactly. A little more context on how I came to ask the question. I wanted to accumulate all input on an arbitrary stream within 5ms, without blocking for longer. Using raw IO + select, its possible to loop, reading one byte at a time. The io module doesn't have an API (that I could find) for putting an existing stream into non-blocking mode, so reading a larger amount and taking what is returned isn't viable.
What do you mean exactly by that?
Just what I said. I'll happily try to rephrase. What bit was unclear?
I don't understand what you mean by "putting an existing stream into non-blocking mode"? What stream exactly is it? And why is reading a larger amount not viable?
sys.stdin - starts in blocking mode. How do you convert it to non-blocking mode? Portably? Now, how do you convert it to non-blocking mode when you don't know that it is fd 1, and instead you just have a stream (TextIOWrapper or BufferedReader or even a RawIO instance) ? If you have an fd in blocking mode, and select indicates it is readable, reading one byte won't block. reading two bytes may block. In non-blocking mode, reading will never block, and select tells you whether you can expect any content at all to be available. So reading more than one byte isn't viable when: - the fd is in blocking mode - you don't want to block in your program The reason I run into this is that I have a program that deals with both interactive and bulk traffic on the same file descriptor, and there doesn't seem to be a portable way (where portable means Linux/BSD/MacOSX/Windows) to flip a stream to non-blocking mode (in Python, going by the io module docs). -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Tue, 5 Mar 2013 10:17:11 +1300 Robert Collins <robertc@robertcollins.net> wrote:
sys.stdin - starts in blocking mode. How do you convert it to non-blocking mode? Portably? Now, how do you convert it to non-blocking mode when you don't know that it is fd 1, and instead you just have a stream (TextIOWrapper or BufferedReader or even a RawIO instance) ?
How about the fileno() method?
On 5 March 2013 10:15, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:17:11 +1300 Robert Collins <robertc@robertcollins.net> wrote:
sys.stdin - starts in blocking mode. How do you convert it to non-blocking mode? Portably? Now, how do you convert it to non-blocking mode when you don't know that it is fd 1, and instead you just have a stream (TextIOWrapper or BufferedReader or even a RawIO instance) ?
How about the fileno() method?
What about it? Do you mean 'non-blocking mode is entirely defined by the OS level read() behaviour and there is no tracking of that state higher up' ? If so cool (and we should document that somewhere). I'll need to go lookup the windows equivalent to FCNTL, and I still think the current hidden buffer status is problematic. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Tue, 5 Mar 2013 10:29:35 +1300 Robert Collins <robertc@robertcollins.net> wrote:
How about the fileno() method?
What about it? Do you mean 'non-blocking mode is entirely defined by the OS level read() behaviour and there is no tracking of that state higher up' ? If so cool (and we should document that somewhere).
Yes, I mean that :-) You're right, it should be documented.
I'll need to go lookup the windows equivalent to FCNTL, and I still think the current hidden buffer status is problematic.
Windows has no notion of non-blocking streams, except for sockets. Regards Antoine.
On 5 March 2013 10:32, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:29:35 +1300 Robert Collins <robertc@robertcollins.net> wrote:
How about the fileno() method?
What about it? Do you mean 'non-blocking mode is entirely defined by the OS level read() behaviour and there is no tracking of that state higher up' ? If so cool (and we should document that somewhere).
Yes, I mean that :-) You're right, it should be documented.
I'll need to go lookup the windows equivalent to FCNTL, and I still think the current hidden buffer status is problematic.
Windows has no notion of non-blocking streams, except for sockets.
Hmm, I know the libc emulation layer doesn't - but http://msdn.microsoft.com/en-us/library/ms684961%28VS.85%29.aspx does non-blocking IO (for stdin specifically) - we should be able to hook that in, in principle... and disk files can do nonblocking with overlapped IO (though that is a wholly different beast and clearly offtopic :)). -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Tue, 5 Mar 2013 10:50:01 +1300 Robert Collins <robertc@robertcollins.net> wrote:
On 5 March 2013 10:32, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:29:35 +1300 Robert Collins <robertc@robertcollins.net> wrote:
How about the fileno() method?
What about it? Do you mean 'non-blocking mode is entirely defined by the OS level read() behaviour and there is no tracking of that state higher up' ? If so cool (and we should document that somewhere).
Yes, I mean that :-) You're right, it should be documented.
I'll need to go lookup the windows equivalent to FCNTL, and I still think the current hidden buffer status is problematic.
Windows has no notion of non-blocking streams, except for sockets.
Hmm, I know the libc emulation layer doesn't - but http://msdn.microsoft.com/en-us/library/ms684961%28VS.85%29.aspx does non-blocking IO (for stdin specifically) - we should be able to hook that in, in principle...
I didn't know about that. I wonder, what happens if the standard input is redirected? Also, is it able to read actual raw bytes? INPUT_RECORD looks rather specialized: http://msdn.microsoft.com/en-us/library/ms683499%28v=vs.85%29.aspx
and disk files can do nonblocking with overlapped IO (though that is a wholly different beast and clearly offtopic :)).
It's not non-blocking then, it's asynchronous (it's blocking but in another thread ;-)). Regards Antoine.
On 5 March 2013 20:31, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:50:01 +1300
I didn't know about that. I wonder, what happens if the standard input is redirected? Also, is it able to read actual raw bytes? INPUT_RECORD looks rather specialized: http://msdn.microsoft.com/en-us/library/ms683499%28v=vs.85%29.aspx
I don't know; cygwin's source may, or we could get someone with a Windows machine to do some testing.
and disk files can do nonblocking with overlapped IO (though that is a wholly different beast and clearly offtopic :)).
It's not non-blocking then, it's asynchronous (it's blocking but in another thread ;-)).
Well... its not in another userspace thread - its near-identical in implementation to Linux AIO : the kernel takes care of it. The deliver mechanism is however very different (you sleep and the kernel calls you back). -Rob
Le Tue, 5 Mar 2013 20:39:55 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
On 5 March 2013 20:31, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:50:01 +1300
I didn't know about that. I wonder, what happens if the standard input is redirected? Also, is it able to read actual raw bytes? INPUT_RECORD looks rather specialized: http://msdn.microsoft.com/en-us/library/ms683499%28v=vs.85%29.aspx
I don't know; cygwin's source may, or we could get someone with a Windows machine to do some testing.
Apparently you need ReadConsole to read bytes, not ReadConsoleInput: http://msdn.microsoft.com/en-us/library/ms684958%28v=vs.85%29.aspx However, none of those functions is technically non-blocking. You can poll the console using one of the wait functions, but there is an important caveat for ReadConsole: « If the input buffer contains input events other than keyboard events (such as mouse events or window-resizing events), they are discarded. Those events can only be read by using the ReadConsoleInput function. » So it seems ReadConsole can block even though you think some data is available.
It's not non-blocking then, it's asynchronous (it's blocking but in another thread ;-)).
Well... its not in another userspace thread - its near-identical in implementation to Linux AIO : the kernel takes care of it. The deliver mechanism is however very different (you sleep and the kernel calls you back).
It's still not non-blocking. On a non-blocking stream, a read fails when no data is available; you have to try reading again later. With asynchronous I/O, the blocking read is scheduled in the background, and it will call you back when finished. It's a different mode of operation. Regards Antoine.
On 5 March 2013 22:22, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Tue, 5 Mar 2013 20:39:55 +1300, Robert Collins <robertc@robertcollins.net> a écrit :
On 5 March 2013 20:31, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Tue, 5 Mar 2013 10:50:01 +1300
I didn't know about that. I wonder, what happens if the standard input is redirected? Also, is it able to read actual raw bytes? INPUT_RECORD looks rather specialized: http://msdn.microsoft.com/en-us/library/ms683499%28v=vs.85%29.aspx
I don't know; cygwin's source may, or we could get someone with a Windows machine to do some testing.
Apparently you need ReadConsole to read bytes, not ReadConsoleInput: http://msdn.microsoft.com/en-us/library/ms684958%28v=vs.85%29.aspx
However, none of those functions is technically non-blocking. You can poll the console using one of the wait functions, but there is an important caveat for ReadConsole:
« If the input buffer contains input events other than keyboard events (such as mouse events or window-resizing events), they are discarded. Those events can only be read by using the ReadConsoleInput function. »
So it seems ReadConsole can block even though you think some data is available.
http://msdn.microsoft.com/en-us/library/ms685035%28v=vs.85%29.aspx Suggests you can indeed get key events from ReadConsoleInput. I don't know what redirected input does in that case. Anywhich way, its some future work that doesn't affect what can be done now. Thanks for the extended discussion, I think the next stage is for me to make a timeslice to put a patch together. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Cloud Services
On Tue, Mar 5, 2013, at 5:03, Robert Collins wrote:
Suggests you can indeed get key events from ReadConsoleInput. I don't know what redirected input does in that case.
Redirected I/O does not, in general, work with console functions. (I haven't tried ReadConsoleInput, but even ReadConsole [which returns characters] and WriteConsole don't work). You would have to detect whether the standard input(/output/etc) handle is a console and behave differently depending on if it is or not.
On Tue, Mar 5, 2013, at 4:22, Antoine Pitrou wrote:
Apparently you need ReadConsole to read bytes, not ReadConsoleInput: http://msdn.microsoft.com/en-us/library/ms684958%28v=vs.85%29.aspx
ReadConsole reads characters. Using ReadConsoleA to get bytes is almost certainly not what you want 90% of the time. Unfortunately, Python does it (or, more likely, uses ReadFile which does the same thing) now, at least in version 2.7. I may post to this list later this week suggesting improvements to the console streams on win32.
Antoine Pitrou wrote:
Raw I/O is exactly for those cases. Non-blocking buffered I/O is a hard conceptual problem:
I don't think it needs to be all that hard as long as you're willing to give each layer of the protocol stack its own non-blocking I/O calls. Trying to take shortcuts by skipping layers of the stack is asking for pain, though. -- Greg
Guido van Rossum wrote:
Personally, I'm not sure that your proposed API (a buffered_only flag to read()) is the best way to go about it. Maybe detach() should return the remaining buffered data?
Maybe you could be allowed to read() from the buffered stream after detatching the underlying source, which would then return any data remaining in the buffer. -- Greg
Le Tue, 05 Mar 2013 10:14:08 +1300, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
Guido van Rossum wrote:
Personally, I'm not sure that your proposed API (a buffered_only flag to read()) is the best way to go about it. Maybe detach() should return the remaining buffered data?
Maybe you could be allowed to read() from the buffered stream after detatching the underlying source, which would then return any data remaining in the buffer.
Perhaps detach() can take an optional argument for that indeed. Regards Antoine.
Antoine Pitrou wrote:
Le Tue, 05 Mar 2013 10:14:08 +1300, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
Maybe you could be allowed to read() from the buffered stream after detatching the underlying source, which would then return any data remaining in the buffer.
Perhaps detach() can take an optional argument for that indeed.
Does it need to be optional? Is there likely to be any code around that relies on read() *not* working on a detached stream? -- Greg
On Wed, 06 Mar 2013 10:49:30 +1300 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Antoine Pitrou wrote:
Le Tue, 05 Mar 2013 10:14:08 +1300, Greg Ewing <greg.ewing@canterbury.ac.nz> a écrit :
Maybe you could be allowed to read() from the buffered stream after detatching the underlying source, which would then return any data remaining in the buffer.
Perhaps detach() can take an optional argument for that indeed.
Does it need to be optional? Is there likely to be any code around that relies on read() *not* working on a detached stream?
detach() closes the stream by default, which is piece of behaviour you can't change nilly-willy. Regards Antoine.
participants (7)
-
Antoine Pitrou
-
Benjamin Peterson
-
Greg Ewing
-
Guido van Rossum
-
Nick Coghlan
-
random832@fastmail.us
-
Robert Collins