[Python-ideas] Hooks into the IO system to intercept raw file reads/writes

Mon Feb 2 18:10:12 CET 2015

On 2 February 2015 at 16:31, Guido van Rossum <guido at python.org> wrote:
> I'm all for flexible I/O processing, but I worry that the idea brought up
> here feels a little half-baked. First of all, it seems to mention two
> separate cases of subclassing (both io.RawIOBase and subprocess.Popen).
> These days, subclassing(*) is often an anti-pattern: unless done with
> considerable foresight, every detail of the base class implementation
> essentially becomes part of the interface that the subclass relies upon, and
> now the base class becomes too constrained in its evolution. In my
> experience, a well-done API is usually much easier to evolve than even a
> very-well-done base class.
>
> The other thing is that I can't actually imagine the details of your
> proposal. Is the idea that you subclass RawIOBase to implement "tee"
> behavior? Why can't you do that at the receiving end? Is perhaps the
> proposal to assign the base object a work-around for a interface design in
> the Popen class? (I'm sure that class is far from perfect -- but it's also
> super constrained by the need to support Windows process creation.)

The idea is certainly a little half-baked :-( And you're absolutely
right that it's strongly linked to a fight to work around limitations
of subprocess.Popen. The suggestion originally came out of a couple of
things I've been working on, one of which was trying to make a Popen
call that captured the stdout/stderr streams while still displaying
them (as you say, a "tee" type of mechanism).

It's certainly possible to do the "tee" at the receiving end, but
(because of the aforementioned Popen limitations) doing so requires
ignoring the convenience of communicate() and writing your own capture
code. That's not *too* hard using threads, but Popen avoids threads on
Unix, using a select loop instead, and I'm not clear why, and whether
my solution will break in the situations the Popen code is covering
via the select loop. Also, getting corner cases in the capture code
right (around encodings in particular) is something I'd prefer to
leave to subprocess :-) The original issue was for a PR for a project
that works on a lot of platforms I don't have access to, so I may well
have been worrying too much about "not breaking stuff" :-)

This proposal basically came from a feeling that if only I could "see"
the data as it flows through the buffers of an existing io stream, I
wouldn't have all these problems. Originally I was going to suggest a
"buffer filled" type of callback. With such a hook, though, I was
thinking I could do

p = Popen(..., stdout=PIPE, stderr=PIPE)
# Not sure if these need to be at the Raw IO level or the buffered IO
level. Should be called every time an OS read happens.
p.stdout.buffer.add_buffer_watcher(lambda buf:
os.write(sys.stdout.fileno(), buf))
p.stderr.buffer.add_buffer_watcher(lambda buf:
os.write(sys.stderr.fileno(), buf))

I guess that's a cleaner proposal, although I pretty much assumed that
the overhead of such a hook being checked for on every buffer read
would be unacceptable. So I came up with a clumsier approach based on
trying to make it so you only paid the cost if you used the feature.
Overall, that was probably a mistake :-(

I hope it's clearer now.

Paul