[Python-ideas] BufferedIO and detach

Mon Mar 4 07:44:27 CET 2013

On 4 March 2013 18:50, Guido van Rossum <guido at python.org> wrote:
> On Sun, Mar 3, 2013 at 6:31 PM, Robert Collins
> <robertc at robertcollins.net> wrote:
>> On 4 March 2013 05:16, Benjamin Peterson <benjamin at python.org> wrote:
>>> Robert Collins <robertc at ...> writes:
>>>
>>>>
>>>> There doesn't seem to be a way to safely use detach() on stdin - I'd
>>>> like to get down to the raw stream, but after calling detach(), the
>>>> initial BufferedIOReader is unusable - so you cannot retrieve any
>>>> buffered content) - and unless you detach(), you can't guarantee that
>>>> the buffer will ever be empty.
>>>
>>> Presumably if you call it before anyone else has had a chance to read from it,
>>> you should be okay.
>>
>> Thats hard to guarantee in the general case: consider a library
>> utility that accepts an input stream. To make it concrete, consider
>> dispatching to different processors based on the first few bytes of a
>> stream: you'd have to force raw IO handling everywhere, rather than
>> just the portion of code that needs it...
>
> The solution would seem obvious: detach before reading anything from the stream.
>
> But apparently you're trying to come up with a reason why that's not
> enough. I think you're concerned about the situation where you have a
> stream of uncertain origin, and you want to switch to raw, unbuffered
> I/O. You realize that some of the bytes you are interested in might
> already have been read into the buffer. So you want access to the
> contents of the buffer.

Yes exactly. A little more context on how I came to ask the question.
I wanted to accumulate all input on an arbitrary stream within 5ms,
without blocking for longer. Using raw IO + select, its possible to
loop, reading one byte at a time. The io module doesn't have an API
(that I could find) for putting an existing stream into non-blocking
mode, so reading a larger amount and taking what is returned isn't
viable.

However, without raw I/O, select() will timeout because it consults
the underlying file descriptor bypassing the buffer. So - the only
reason to want raw I/O is to be able to use select reliably. An
alternative would be being able to drain the buffer with no underlying
I/O calls at all, then use select + read1, then rinse and repeat.

> When the io module was originally designed, this was actually one of
> the (implied) use cases -- one reason I wanted to stop using C stdio
> was that I didn't like that there is no standard way to get at the
> data in the buffer, in similar use cases as you're trying to present.
> (A use case I could think of would be an http server that forks a
> subprocess after reading e.g. the first line of the http request, or
> perhaps after the headers.)

Thats a very similar case as it happens - protocol handling is present
in my use case too.

> It seems that the when the io module was rewritten in C for speed (and
> I am very grateful that it was, the Python version was way too slow)
> this use case, being pretty rare, was forgotten. In specific use cases
> it's usually easy enough to just open the file unbuffered, or detach
> before reading anything.
>
> Can you write C code? If so, perhaps you can come up with a patch.
> Personally, I'm not sure that your proposed API (a buffered_only flag
> to read()) is the best way to go about it. Maybe detach() should
> return the remaining buffered data? (Perhaps only if a new flag is
> given.)
>
> FWIW I think it's also possible that some of the data has made it into
> the text wrapper already, so you'll have to be able to extract it from
> there as well. (Good luck.)

I can write C code, and if evolving the API is acceptable (it sounds
like it is) I'll be more than happy to make a patch.

Some variations I can think of...

The buffer_only flag I suggested, on read_into, read1, read etc.

Have detach return the buffered data as you suggest - that would be
incompatible unless we stash it on the raw object somewhere, or do
something along those lines.

A read0 - analogous to read1, returns data from the buffer, but
guarantees no underlying calls.

I think exposing the buffer more explicitly is a good principle,
independent of whether we change detach or not.

> --
> --Guido van Rossum (python.org/~guido)

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services