
Greetings, As I'm sure you all know there are currently two implementations of the io module one in python and one much faster implementation in C. As I recall the python version was used in python3 and the C version is now used by default in python3.1x. The behavior of the two is different in some ways especially regarding io.BufferedReader.peek(). I wrote an email to the authors of the new C code last Friday. I also sent a copy of it to the python list for comments. I was directed by Antoine Pitrou that I should possibly bring up what I had asked there here or as a bug report. I elected to write here because I am not sure it constitutes a bug. In my former email I stated I was willing to submit patches if the old behavior was desired back and the code author was fine with the changes but didn't want to implement them. Antoine said this, "If people need more sophisticated semantics, I'm open to changing peek() to accommodate it." Antoine: If I do wrong quoting you are free to chastise me. So my basic question is: The behavior of io.BufferedReader.peek() has changed; is that change something that should: stay as is, revert or be different entirely? Here are the two behaviors: The python version of io.BufferedReader.peek() behaves as: If the buffer holds less than requested (upto buffersize) read from the raw stream the difference or up to EOF into the buffer. Return requested number of bytes from the start of the buffer. This may advance the raw stream but not the local stream. This version can guarantee a peek of one chunk (4096 bytes here). The C version behaves as: If the buffer holds 0 bytes fill from the raw stream or up to EOF. Return what is in the buffer. This may advance the raw stream but not the local stream. This version cannot guarantee a peek of over 1 byte if random length reads are being used at all and not tracked. Neither case limits what is possible, though, in my opinion, one makes it easier to accomplish certain things and is more efficient in those cases. Take the following two parser examples: s = io.BufferedReader wrapped stream with no negative seek in most cases. f = output file handler or such. python version work flow: are = re.compile(b'(\r\n|\r|\n)') while True: d = s.peek(4096) # chunk size or so. found = are.search(d) if found: w = d[:found.start()] s.seek(f.write(w)) p = s.peek(74) if p.startswith(multipart_boundary): s.seek(len(multipart_boundary)) # other code containing more possible splits # across boundaries continue w = d[found.start():found.end()] s.seek(f.write(w)) continue f.write(d) #more code continue C version work flow: old = b'' are = re.compile(b'(\r\n|\r|\n)') while True: d = old if old != b'' else s.read1(4096) found = are.search(d) if found: w = d[:found.start()] f.write(w) w = d[found.start():] p = w if len(w) >= 74 else w + s.read(73) if p.startswith(multipart_boundary): # Other code containing more possible splits # across boundaries and joins to p. old = ??? continue f.write(d[found.start():found.end()]) old = dd[found.end():] + p continue old = b'' f.write(d) #more code continue These two examples are not real code but get the point across and are based off code I put into a multipart parser. The former written for python3. I later tried running that parser on 3.1 after the new io layer and found it broken. Then rewrote it to the new interface. That rewrite is represented in the latter some what. This is only one example. Others may vary, of course. Peek seems to me to have little use outside of parsers. Thus I used parsers as an example. My opinion is that it would be better to have a peek function similar to the the python implementation in C like as follows: peek(n): If n is less than 0, None, or not set; return buffer contents with out advancing stream position. If the buffer is empty read a full chunk and return the buffer. Otherwise return exactly n bytes up to _chunk size_(not contents) with out advancing the stream position. If the buffer contents is less than n, buffer an additional chunk from the "raw" stream before hand. If EOF is encountered during any raw read then return as much as we can up to n. (maybe I should write that in code form??) This allows us to obtain the behavior of the current C io implementation easily and would give us the old python implementation's behavior when n is given. The basis for this is: 1. Code reduction and Simplicity Looking at the examples, the code reduction should be obvious. The logic needed to maintain a bytestring of the variously required lengths, so that it may be checked, would not be necessary. The need to hold a bytestring to the next iteration would be done away with as well. Other pieces of data handling would also be simpler. 2. Speed It would require less handling in the "slower" interpreter if we would use the buffer in the buffered reader. Also, all that logic mentioned in 1 is moved to the faster C code or done away with. There is very little necessity for peek outside of parsers, so speed in read-through and random reads would not have to be affected. I have other reasons and arguments, but I want to know what every one else thinks. This will most likely show me what I have missed or am not seeing, if anything. Please I have babbled enough. Thanks so much for the consideration. Frederick Reeve