io.BufferedReader.peek() Behaviour in python3.1
Greetings, As I'm sure you all know there are currently two implementations of the io module one in python and one much faster implementation in C. As I recall the python version was used in python3 and the C version is now used by default in python3.1x. The behavior of the two is different in some ways especially regarding io.BufferedReader.peek(). I wrote an email to the authors of the new C code last Friday. I also sent a copy of it to the python list for comments. I was directed by Antoine Pitrou that I should possibly bring up what I had asked there here or as a bug report. I elected to write here because I am not sure it constitutes a bug. In my former email I stated I was willing to submit patches if the old behavior was desired back and the code author was fine with the changes but didn't want to implement them. Antoine said this, "If people need more sophisticated semantics, I'm open to changing peek() to accommodate it." Antoine: If I do wrong quoting you are free to chastise me. So my basic question is: The behavior of io.BufferedReader.peek() has changed; is that change something that should: stay as is, revert or be different entirely? Here are the two behaviors: The python version of io.BufferedReader.peek() behaves as: If the buffer holds less than requested (upto buffersize) read from the raw stream the difference or up to EOF into the buffer. Return requested number of bytes from the start of the buffer. This may advance the raw stream but not the local stream. This version can guarantee a peek of one chunk (4096 bytes here). The C version behaves as: If the buffer holds 0 bytes fill from the raw stream or up to EOF. Return what is in the buffer. This may advance the raw stream but not the local stream. This version cannot guarantee a peek of over 1 byte if random length reads are being used at all and not tracked. Neither case limits what is possible, though, in my opinion, one makes it easier to accomplish certain things and is more efficient in those cases. Take the following two parser examples: s = io.BufferedReader wrapped stream with no negative seek in most cases. f = output file handler or such. python version work flow: are = re.compile(b'(\r\n|\r|\n)') while True: d = s.peek(4096) # chunk size or so. found = are.search(d) if found: w = d[:found.start()] s.seek(f.write(w)) p = s.peek(74) if p.startswith(multipart_boundary): s.seek(len(multipart_boundary)) # other code containing more possible splits # across boundaries continue w = d[found.start():found.end()] s.seek(f.write(w)) continue f.write(d) #more code continue C version work flow: old = b'' are = re.compile(b'(\r\n|\r|\n)') while True: d = old if old != b'' else s.read1(4096) found = are.search(d) if found: w = d[:found.start()] f.write(w) w = d[found.start():] p = w if len(w) >= 74 else w + s.read(73) if p.startswith(multipart_boundary): # Other code containing more possible splits # across boundaries and joins to p. old = ??? continue f.write(d[found.start():found.end()]) old = dd[found.end():] + p continue old = b'' f.write(d) #more code continue These two examples are not real code but get the point across and are based off code I put into a multipart parser. The former written for python3. I later tried running that parser on 3.1 after the new io layer and found it broken. Then rewrote it to the new interface. That rewrite is represented in the latter some what. This is only one example. Others may vary, of course. Peek seems to me to have little use outside of parsers. Thus I used parsers as an example. My opinion is that it would be better to have a peek function similar to the the python implementation in C like as follows: peek(n): If n is less than 0, None, or not set; return buffer contents with out advancing stream position. If the buffer is empty read a full chunk and return the buffer. Otherwise return exactly n bytes up to _chunk size_(not contents) with out advancing the stream position. If the buffer contents is less than n, buffer an additional chunk from the "raw" stream before hand. If EOF is encountered during any raw read then return as much as we can up to n. (maybe I should write that in code form??) This allows us to obtain the behavior of the current C io implementation easily and would give us the old python implementation's behavior when n is given. The basis for this is: 1. Code reduction and Simplicity Looking at the examples, the code reduction should be obvious. The logic needed to maintain a bytestring of the variously required lengths, so that it may be checked, would not be necessary. The need to hold a bytestring to the next iteration would be done away with as well. Other pieces of data handling would also be simpler. 2. Speed It would require less handling in the "slower" interpreter if we would use the buffer in the buffered reader. Also, all that logic mentioned in 1 is moved to the faster C code or done away with. There is very little necessity for peek outside of parsers, so speed in read-through and random reads would not have to be affected. I have other reasons and arguments, but I want to know what every one else thinks. This will most likely show me what I have missed or am not seeing, if anything. Please I have babbled enough. Thanks so much for the consideration. Frederick Reeve
Greetings, I feel the need to point out I made a mistake. When I wrote my last email I said the behavior had changed python3-3.1. This seems not to be the case.. I had made that assumption because I had written code based on the looking at the code in _pyio.py as well as the python3 documentation (http://docs.python.org/3.0/library/io.html#io.BufferedReader) which seems to be wrong on that point or I miss understand. Anyway I'm sorry about that. The other point still stands though. I would like to see peek changed. I am willing to write and submit changes but don't want to unless others agree this is a good idea. So I put forth the implementation at the bottom of this email. If its bad or you don't see the point I may try to clarify but if nobody things its good, please just tell me I'm waisting your time, and I will go away. I also apologize my last email was so long. peek(n): If n is less than 0, None, or not set; return buffer contents with out advancing stream position. If the buffer is empty read a full chunk and return the buffer. Otherwise return exactly n bytes up to _chunk size_(not contents) with out advancing the stream position. If the buffer contents is less than n, buffer an additional chunk from the "raw" stream before hand. If EOF is encountered during any raw read then return as much as we can up to n. (maybe I should write that in code form??) Thanks Frederick Reeve
Frederick Reeve wrote:
The other point still stands though. I would like to see peek changed. I am willing to write and submit changes but don't want to unless others agree this is a good idea. So I put forth the implementation at the bottom of this email. If its bad or you don't see the point I may try to clarify but if nobody things its good, please just tell me I'm waisting your time, and I will go away. I also apologize my last email was so long.
peek(n): If n is less than 0, None, or not set; return buffer contents with out advancing stream position. If the buffer is empty read a full chunk and return the buffer. Otherwise return exactly n bytes up to _chunk size_(not contents) with out advancing the stream position. If the buffer contents is less than n, buffer an additional chunk from the "raw" stream before hand. If EOF is encountered during any raw read then return as much as we can up to n. (maybe I should write that in code form??)
I would phrase this suggestion as users having a reasonable expectation that the following invariant should hold for a buffered stream: f.peek(n) == f.read(n) Since the latter method will perform as many reads of the underlying stream as necessary to reach the requested number of bytes (or EOF), then so should the former. However, the default value for n for peek() should remain at 1 to remain consistent with the current documented behaviour. If this invariant was implemented, I would also suggest adding a "peek1" method to parallel "read1". Note that the current behaviour I get from Python 3.1 is for it to return the *entire* buffer, no matter what number I pass to it: (current Py3k head)
f = open('setup.py', 'rb') len(f.peek(10)) 4096 len(f.peek(1)) 4096 len(f.peek(4095)) 4096 len(f.peek(10095)) 4096
That's an outright bug - I've promoted an existing issue about this [1] to a release blocker and sent it to Benjamin to have another look at. Cheers, Nick. [1] http://bugs.python.org/issue5811 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
On 13Jun2009 12:24, Nick Coghlan <ncoghlan@gmail.com> wrote: | Frederick Reeve wrote: | > The other point still stands though. I would like to see peek | > changed. I am willing to write and submit changes but don't want to | > unless others agree this is a good idea. So I put forth the | > implementation at the bottom of this email. If its bad or you don't | > see the point I may try to clarify but if nobody things its good, | > please just tell me I'm waisting your time, and I will go away. I | > also apologize my last email was so long. | > | > peek(n): If n is less than 0, None, or not set; return buffer | > contents with out advancing stream position. If the buffer is empty | > read a full chunk and return the buffer. Otherwise return exactly n | > bytes up to _chunk size_(not contents) with out advancing the stream | > position. If the buffer contents is less than n, buffer an | > additional chunk from the "raw" stream before hand. If EOF is | > encountered during any raw read then return as much as we can up to | > n. (maybe I should write that in code form??) | | I would phrase this suggestion as users having a reasonable expectation | that the following invariant should hold for a buffered stream: | | f.peek(n) == f.read(n) | | Since the latter method will perform as many reads of the underlying | stream as necessary to reach the requested number of bytes (or EOF), | then so should the former. I disagree. If that were that case, why have peek() at all? I realise that it doesn't move the logical position, but it does mean that peek(huge_number) imposes a requirement to grow the stream buffer arbitrarily. A peek that does at most one raw read has the advantage that it can pick up data outside the buffer but lurking in the OS buffer, yet to be obtained. Those data are free, if they're present. (Of course, if they're absent peek() wil still block). Since (if the OS buffer is also empty) even a peek(1) can block, maybe we should canvas peek()'s common use cases. Naively (never having used peek()), my own desire would normally be for a peek(n, block=False) a bit like Queue.get(). Then I could be sure not to block if I wanted to avoid it, even on a blocking stream, yet still obtain unread buffered data if present. So: what do people use peek() for, mostly? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ We're in the business of putting goo on a substrate. - overhead by WIRED at the Intelligent Printing conference Oct2006
Cameron Simpson wrote:
On 13Jun2009 12:24, Nick Coghlan <ncoghlan@gmail.com> wrote: | I would phrase this suggestion as users having a reasonable expectation | that the following invariant should hold for a buffered stream: | | f.peek(n) == f.read(n) | | Since the latter method will perform as many reads of the underlying | stream as necessary to reach the requested number of bytes (or EOF), | then so should the former.
I disagree. If that were that case, why have peek() at all? I realise that it doesn't move the logical position, but it does mean that peek(huge_number) imposes a requirement to grow the stream buffer arbitrarily.
A peek that does at most one raw read has the advantage that it can pick up data outside the buffer but lurking in the OS buffer, yet to be obtained. Those data are free, if they're present. (Of course, if they're absent peek() wil still block).
Note my suggestion later that if the above invariant were to be adopted then a peek1() method should be added to parallel read1(). However, from what Benjamin has said, a more likely invariant is going to be: preview = f.peek(n) f.read(n).startswith(preview) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Nick Coghlan <ncoghlan <at> gmail.com> writes:
Since the latter method will perform as many reads of the underlying stream as necessary to reach the requested number of bytes (or EOF), then so should the former.
How do you propose to implement this while staying compatible with 1) unseekable raw streams 2) the expectation that peek() doesn't advance the logical file pointer?
Note that the current behaviour I get from Python 3.1 is for it to return the *entire* buffer, no matter what number I pass to it:
[...]
That's an outright bug - I've promoted an existing issue about this [1] to a release blocker and sent it to Benjamin to have another look at.
The original docstring for peek() says: """Returns buffered bytes without advancing the position. The argument indicates a desired minimal number of bytes; we do at most one raw read to satisfy it. We never return more than self.buffer_size. """ In that light, I'm not sure it's a bug -- although it can certainly look unexpected at first sight. Regards Antoine.
Antoine Pitrou wrote:
The original docstring for peek() says:
...we do at most one raw read to satisfy it.
In that light, I'm not sure it's a bug
It may be behaving according to the docs, but is that behaviour useful? Seems to me that if you're asking for n bytes, then it's because you're doing some kind of parsing that requires lookahead, and nothing less than n bytes will do. I think it would be more useful if the "at most one raw read" part were dropped. That would give it the kind of deterministic behaviour generally expected when dealing with buffered streams. -- Greg
Greg Ewing <greg.ewing <at> canterbury.ac.nz> writes:
I think it would be more useful if the "at most one raw read" part were dropped. That would give it the kind of deterministic behaviour generally expected when dealing with buffered streams.
As I already told Nick: please propose an implementation scheme. Antoine.
On 14Jun2009 12:33, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Antoine Pitrou wrote:
The original docstring for peek() says:
...we do at most one raw read to satisfy it.
In that light, I'm not sure it's a bug
It may be behaving according to the docs, but is that behaviour useful?
Seems to me that if you're asking for n bytes, then it's because you're doing some kind of parsing that requires lookahead, and nothing less than n bytes will do.
I think it would be more useful if the "at most one raw read" part were dropped. That would give it the kind of deterministic behaviour generally expected when dealing with buffered streams.
Is it possible to access the buffer? I see nothing in the docs. People seem to want peek() to be "read() without moving the read offset", which it almost seems to be. Nick and Greg both want it to really be that, and thus do enough raw reads to get "n" bytes; Nick wants a peek1() like read1(), too. It has a pleasing feel to me, too. But ... For myself, I'd expect more often to want to see if there's stuff in the buffer _without_ doing any raw reads at all. A peek0(n), if you will: Read and return up to n bytes without calling on the raw stream. It feels like peek is trying to span both extremes and doesn't satisfy either really well. If peek gets enhanced to act like read in terms of the amount of data returned, should there not be a facility to examine buffered data without raw reads? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Being on a Beemer and not having a wave returned by a Sportster is like having a clipper ship's hailing not returned by an orphaned New Jersey solid waste barge. - OTL
On 14Jun2009 15:16, I wrote: | Is it possible to access the buffer? I see nothing in the docs. I've just found getvalue() in IOBase. Forget I said anything. It seems to be my day for that kind of post:-( -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ These are but a few examples of what can happen when the human mind is employed to learn, to probe, to question as opposed to merely keeping the ears from touching. - rec.humor.funny 90.07.16
2009/6/14 Cameron Simpson <cs@zip.com.au>:
On 14Jun2009 15:16, I wrote: | Is it possible to access the buffer? I see nothing in the docs.
I've just found getvalue() in IOBase. Forget I said anything. It seems to be my day for that kind of post:-(
Where are you seeing this? Only BytesIO and StringIO have a getvalue() method. -- Regards, Benjamin
On 14Jun2009 09:21, Benjamin Peterson <benjamin@python.org> wrote: | 2009/6/14 Cameron Simpson <cs@zip.com.au>: | > On 14Jun2009 15:16, I wrote: | > | Is it possible to access the buffer? I see nothing in the docs. | > | > I've just found getvalue() in IOBase. Forget I said anything. | > It seems to be my day for that kind of post:-( | | Where are you seeing this? Only BytesIO and StringIO have a getvalue() method. I had thought I'd traced it by class inheritance. But I got BytesIO and IOBase confused. So: no getvalue then. So probably there is a case for peek0(), which never does a raw read. Thoughts? -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ I was gratified to be able to answer promptly and I did. I said I didn't know. - Mark Twain
On 15Jun2009 11:48, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
For myself, I'd expect more often to want to see if there's stuff in the buffer _without_ doing any raw reads at all.
What uses do you have in mind for that?
It seems like whenever I want to do some kind of opportunistic but non-blocking stuff with a remote service (eg something I have a packetising connection to, such as a TCP stream) I want to be able to see if there's "more stuff" to gather up before issuing a "flush" operation. And I want to be able to do that in a non-blocking way, much as a Queue has a non-blocking get() method. As an example, I've got the occasional protocol handler where it has to make a remote query. To avoid deadlock, the stream must be flushed after write()ing the query packet. However, flushing on every such packet is horribly wasteful if you know you have a bunch of them (for example, the caller is asking a bunch of questions). It is far far more efficient to write() each packet without flushes, keep the knowledge that a flush is needed, and flush when there's nothing more pending. That way the lower layer has maximum opportunity to pack data into packets. All that presumes another thread reading responses, which is how I generally write this stuff anyway, otherwise a full buffer will deadlock too. So your dispatch thread inner loop looks like this: # single consumer, so Q.empty() implies ok to Q.get() needFlush = False while not Q.empty(): P=Q.get() if P.needFlush: needFlush = True out.write(P.serialise()) if needFlush: out.flush() In this scheme, there _are_ packets that don't need a flush, because nobody is waiting on their response. Anyway, if I were reading from an IO object instead of a Queue I'd want to poll for "buffer empty". If there isn't an empty buffer I know there will be a packet worth of data coming immediately and I can pick it up with regular read()s, just as I'm doing with Q.get() above. But if the buffer is empty I can drop out of the "pick it all up now" loop and flush(). Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ If you take something apart and put it back together again enough times, you will eventually have enough parts left over to build a second one. - Ayse Sercan <ayse@netcom.com>
Cameron Simpson wrote:
It seems like whenever I want to do some kind of opportunistic but non-blocking stuff with a remote service
Do you actually do this with buffered streams? I find it's better to steer well clear of buffered I/O objects when doing non-blocking stuff, because they don't play well with other things like select(). Anyhow, I wouldn't be opposed to having a way of looking into the buffer, but it should be a separate API -- preferably with a better name than peek0(), which gives no clue at all about what it does differently from peek(). -- Greg
On 16Jun2009 11:21, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Cameron Simpson wrote:
It seems like whenever I want to do some kind of opportunistic but non-blocking stuff with a remote service
Do you actually do this with buffered streams?
Sure, in C, python and perl quite happily. In some circumstances. Provided you're careful about when to fflush() it can all go quite smoothly. It's certainly not applicable to everything.
I find it's better to steer well clear of buffered I/O objects when doing non-blocking stuff, because they don't play well with other things like select().
Ah, the non-blockingness. Well that's the rub. I normally avoid non-blocking requirements by using threads, so that the thread gathering from the stream can block. Then the consumer can poll a Queue from the worker thread, etc. I really don't like select(/poll/epoll etc) much; aside from select's scaling issues with lots of files (hence poll/epoll) there are high performance things where having an event loop using select is a good way to go, but I generally prefer using threads and/or generators to make the code clear (single flow of control, single task for the block of code, etc) if there's no reason not to.
Anyhow, I wouldn't be opposed to having a way of looking into the buffer, but it should be a separate API -- preferably with a better name than peek0(), which gives no clue at all about what it does differently from peek().
Indeed, though arguably read1() is a lousy name too, on the same basis. My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ You can't wait for inspiration. You have to go after it with a club. - Jack London
Cameron Simpson wrote:
On 16Jun2009 11:21, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Cameron Simpson wrote:
It seems like whenever I want to do some kind of opportunistic but non-blocking stuff with a remote service Do you actually do this with buffered streams?
Sure, in C, python and perl quite happily. In some circumstances. Provided you're careful about when to fflush() it can all go quite smoothly. It's certainly not applicable to everything.
I find it's better to steer well clear of buffered I/O objects when doing non-blocking stuff, because they don't play well with other things like select().
Ah, the non-blockingness. Well that's the rub. I normally avoid non-blocking requirements by using threads, so that the thread gathering from the stream can block. Then the consumer can poll a Queue from the worker thread, etc.
I really don't like select(/poll/epoll etc) much; aside from select's scaling issues with lots of files (hence poll/epoll) there are high performance things where having an event loop using select is a good way to go, but I generally prefer using threads and/or generators to make the code clear (single flow of control, single task for the block of code, etc) if there's no reason not to.
Anyhow, I wouldn't be opposed to having a way of looking into the buffer, but it should be a separate API -- preferably with a better name than peek0(), which gives no clue at all about what it does differently from peek().
Indeed, though arguably read1() is a lousy name too, on the same basis. My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
Can block, but not if you don't want it too. You might just want to see what, if anything, is currently available, up to n bytes.
On 16Jun2009 02:18, MRAB <python@mrabarnett.plus.com> wrote:
My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
Can block, but not if you don't want it too. You might just want to see what, if anything, is currently available, up to n bytes.
Am I missing something? In the face of an _empty_ buffer (which I can't tell from outside) how do I prevent peek() blocking? More generally, if I go peek(n) and if n > bytes_in_buffer_right_now and the raw stream would block if a raw read is done? My concerns would go away if I could probe the buffer content size; then I could ensure peek(n) chose n <= the content size. If that's not enough, my problem - I can choose to read-and-block or go away and come back later. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ If all around you is darkness and you feel you're contending in vain, then the light at the end of the tunnel is the front of an oncoming train.
Cameron Simpson wrote:
On 16Jun2009 02:18, MRAB <python@mrabarnett.plus.com> wrote:
My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
Can block, but not if you don't want it too. You might just want to see what, if anything, is currently available, up to n bytes.
Am I missing something?
In the face of an _empty_ buffer (which I can't tell from outside) how do I prevent peek() blocking? More generally, if I go peek(n) and if n > bytes_in_buffer_right_now and the raw stream would block if a raw read is done?
My concerns would go away if I could probe the buffer content size; then I could ensure peek(n) chose n <= the content size. If that's not enough, my problem - I can choose to read-and-block or go away and come back later.
I was thinking along the lines of: def peek(self, size=None, block=True) If 'block' is True then return 'size' bytes, unless the end of the file/stream is reached; if 'block' is False then return up to 'size' bytes, without blocking. The blocking form might impose a limit to how much can be peeked (the maximum size of the buffer), or it might enlarge the buffer as necessary.
MRAB wrote:
I was thinking along the lines of:
def peek(self, size=None, block=True) I think this is fine too. :)
If 'block' is True then return 'size' bytes, unless the end of the file/stream is reached; if 'block' is False then return up to 'size' bytes, without blocking. The blocking form might impose a limit to how much can be peeked (the maximum size of the buffer), or it might enlarge the buffer as necessary.
I guess the limit wouldn't be a problem to someone that chose to block further reads.
MRAB wrote:
I was thinking along the lines of: def peek(self, size=None, block=True) If 'block' is True then return 'size' bytes, unless the end of the file/stream is reached; if 'block' is False then return up to 'size' bytes, without blocking....
I tend to prefer zero-ish defaults, how about: def peek(self, size=None, nonblocking=False) We still don't have "at most one read" code, but something a bit like data = obj.peek(size=desired, nonblocking=True) if len(data) < desired: data = obj.peek(size=wanted, nonblocking=False) might suffice. --Scott David Daniels Scott.Daniels@Acm.Org
On approximately 6/16/2009 11:20 AM, came the following characters from the keyboard of Scott David Daniels:
MRAB wrote:
I was thinking along the lines of: def peek(self, size=None, block=True) If 'block' is True then return 'size' bytes, unless the end of the file/stream is reached; if 'block' is False then return up to 'size' bytes, without blocking....
I tend to prefer zero-ish defaults, how about: def peek(self, size=None, nonblocking=False)
No, no, no! Double negatives are extremely easy to not code correctly. The lack of ease of not understanding of code containing double negatives quadruples, at least! Not so differently, I'm sure my sentences here are not easy to understand because I put the inverse logic in them in the places that are not the usual. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Scott David Daniels <Scott.Daniels <at> Acm.Org> writes:
MRAB wrote:
I was thinking along the lines of: def peek(self, size=None, block=True) If 'block' is True then return 'size' bytes, unless the end of the file/stream is reached; if 'block' is False then return up to 'size' bytes, without blocking....
I tend to prefer zero-ish defaults, how about: def peek(self, size=None, nonblocking=False)
Since blocking and non-blocking are already used to refer to different types of raw streams, another verb should be found for this option. Antoine.
Cameron Simpson wrote:
Indeed, though arguably read1() is a lousy name too, on the same basis. My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
I guess all the buffer operations should be transparent to the user if he wants it to be like that, since not so many people want to have a tight control over this kind of detail. I think of peek() as an operation that allows me to peek what's going to show up in the future without affecting further read()s. This kind of behavior is expected by users without prior knowledge of the inner workings of buffered IO. So, if an user _really_ wants to take a look at what's to come without affecting the buffer, we could allow that by doing something like this: peek(5, change_buffer=False) This is an alternative to the peek0(). But I am ok wih the peek0() too.
Cameron Simpson wrote:
I normally avoid non-blocking requirements by using threads, so that the thread gathering from the stream can block.
If you have a thread dedicated to reading from that stream, then I don't see why you need to peek into the buffer. Just have it loop reading a packet at a time and put each completed packet in the queue. If several packets arrive at once, it'll loop around that many times before blocking.
arguably read1() is a lousy name too, on the same basis.
Certainly.
My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
My problem with the idea of looking into the buffer is that it crosses levels of abstraction. A buffered stream is supposed to behave the same way as a deterministic non-buffered stream, with the buffer being an internal optimisation detail that doesn't exist as far as the outside world is concerned. Sometimes it's pragmatic to break the abstraction, but it should be made very obvious when you're doing that. So I'd prefer something like peek_buffer() to make it perfectly clear what's being done. Anything else such as peek() that doesn't explicitly mention the buffer should fit into the abstraction properly. -- Greg
Greg Ewing <greg.ewing <at> canterbury.ac.nz> writes:
Anything else such as peek() that doesn't explicitly mention the buffer should fit into the abstraction properly.
peek() doesn't "fit into the abstraction" since it doesn't even exist on raw streams. While buffered and non-buffered streams have a reasonably similar API, expecting them to behave the same in all circumstances is IMO unrealistic. Antoine.
On 17Jun2009 10:55, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Cameron Simpson wrote:
I normally avoid non-blocking requirements by using threads, so that the thread gathering from the stream can block.
If you have a thread dedicated to reading from that stream, then I don't see why you need to peek into the buffer. Just have it loop reading a packet at a time and put each completed packet in the queue. If several packets arrive at once, it'll loop around that many times before blocking.
Yes, this is true. But people not using threads, or at any rate not dedicating a thread to the reading task, don't have such luxury. Are we disputing the utility of being able to ask "how much might I read/peek without blocking"? Or disputing the purpose of peek, which feels to me like it should/might ask that question, but doesn't. [...]
My itch is that peek() _feels_ like it should be "look into the buffer" but actually can block and/or change the buffer.
My problem with the idea of looking into the buffer is that it crosses levels of abstraction. A buffered stream is supposed to behave the same way as a deterministic non-buffered stream, with the buffer being an internal optimisation detail that doesn't exist as far as the outside world is concerned.
Sometimes it's pragmatic to break the abstraction, but it should be made very obvious when you're doing that. So I'd prefer something like peek_buffer() to make it perfectly clear what's being done.
Anything else such as peek() that doesn't explicitly mention the buffer should fit into the abstraction properly.
It's perfectly possible, even reasonable, to avoid talking about the buffer at all. I'd be happy not to mention the buffer. For example, one can readily imagine the buffered stream being capable of querying its input raw stream if there's "available now" data. The raw stream can sometimes know if a read of a given size would block, or be able to ask what size read will not block. As a concrete example, the UNIX FIONREAD ioctl can generally query a file descriptor for instantly-available data (== in the OS buffer). I've also used UNIXen where your can fstat() a pipe and use the st_size field to test for available unread data in the pipe buffer. Raw streams which can't do that would return 0 (== can't guarentee any non-blocking data) unless the stream itself also had a buffer of its own and it wasn't empty. So I would _want_ the spec for available_data() (new lousy name) to talk about "data available without blocking", allowing the implementation to use data in the IO buffer and/or to query the raw stream, etc. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ For those who understand, NO explanation is needed, for those who don't understand, NO explanation will be given! - Davey D <decoster@vnet.ibm.com>
Cameron Simpson wrote:
But people not using threads, or at any rate not dedicating a thread to the reading task, don't have such luxury.
But without a dedicated thread you need to use select() or poll(), and then buffering causes other headaches.
Are we disputing the utility of being able to ask "how much might I read/peek without blocking"?
I'm saying that I don't see how I would make use of such a thing, so I probably wouldn't mind if it didn't exist.
Or disputing the purpose of peek, which feels to me like it should/might ask that question, but doesn't.
I think what I'm saying is that there are two distinct use cases being talked about for a peek-like operation, and different people seem to have different ideas on which one should be mapped to the name "peek". So perhaps they should both be given more-explicit names.
It's perfectly possible, even reasonable, to avoid talking about the buffer at all. I'd be happy not to mention the buffer.
Even if you don't mention it explicitly, its existence shows through in the fact that there is an arbitrary limit on the amount you can peek ahead, and that limit needs to be documented so that people can write correct programs. This is true of both kinds of peeking, so I concede that they both break the abstraction. However I think the non-blocking peek breaks it more than the blocking one, because it also brings non-deterministic behaviour. -- Greg
Even if you don't mention it explicitly, its existence shows through in the fact that there is an arbitrary limit on the amount you can peek ahead, and that limit needs to be documented so that people can write correct programs.
This is true of both kinds of peeking, so I concede that they both break the abstraction.
However I think the non-blocking peek breaks it more than the blocking one, because it also brings non-deterministic behaviour. It depends on the point of view. For example, someone is writing a program that must read from any kind of file descriptor and generate the derivation tree of the text read
Greg Ewing wrote: based on some context-free grammar. The problem is that the chosen method to accomplish it would read 2 symbols (bytes) ahead and this guy is using peek() to grab these 2 bytes. The program will seem to work correctly most of the time, but on the 4095th byte read, he would grab 1 byte at most using peek() (despite the existence of say 10k bytes ahead). The blocking definition of peek() would create this hard-to-spot bug. Thus, for him, this behavior would seem non-deterministic. On the other hand, if someone is conscious about the number of raw reads, he would definitely be willing to look into the documentation for any parameters that match his special needs. That's why the non-blocking behavior should be the default one while the blocking behavior should be accessible by choice.
Lucas P Melo wrote:
The problem is that the chosen method to accomplish it would read 2 symbols (bytes) ahead and this guy is using peek() to grab these 2 bytes. The program will seem to work correctly most of the time, but on the 4095th byte read, he would grab 1 byte at most using peek()
That's exactly why I think the blocking version should keep reading until the requested number of bytes is available (or the buffer is full or EOF occurs). In other words, the blocking version should be fully deterministic given knowledge of the buffer size. -- Greg
Greg Ewing wrote:
That's exactly why I think the blocking version should keep reading until the requested number of bytes is available (or the buffer is full or EOF occurs). Do you mean that the blocking version should keep waiting for new bytes until they show up? This would not be acceptable, since the program would hang forever most of the time (no changes to the buffer would ever occur in this situation when there's only the main thread running).
Am I understanding this correctly: * The blocking version would not do any raw reads. * The non-blocking version would do.
Frederick Reeve <cylix <at> solace.info> writes:
peek(n): If n is less than 0, None, or not set; return buffer contents with out advancing stream position. If the buffer is empty read a full chunk and return the buffer. Otherwise return exactly n bytes up to _chunk size_(not contents) with out advancing the stream position. If the buffer contents is less than n, buffer an additional chunk from the "raw" stream before hand. If EOF is encountered during any raw read then return as much as we can up to n. (maybe I should write that in code form??)
This proposal looks reasonable to me. Please note that it's too late for 3.1 anyway - we're in release candidate phase. Once you have a patch, you can post it on the bug tracker. Regards Antoine.
On Sat, 13 Jun 2009 12:33:46 +0000 (UTC) Antoine Pitrou <solipsis@pitrou.net> wrote:
This proposal looks reasonable to me. Please note that it's too late for 3.1 anyway - we're in release candidate phase. Once you have a patch, you can post it on the bug tracker.
Thanks I will do that. Sometime in the next couple of weeks. Gratefully Frederick
participants (10)
-
Antoine Pitrou
-
Benjamin Peterson
-
Cameron Simpson
-
Frederick Reeve
-
Glenn Linderman
-
Greg Ewing
-
Lucas P Melo
-
MRAB
-
Nick Coghlan
-
Scott David Daniels