[Python-ideas] Prefetching on buffered IO files

Tue Sep 28 16:32:49 CEST 2010

Le mardi 28 septembre 2010 à 07:08 -0700, Guido van Rossum a écrit :
> 
> Thanks for the long explanation. I have some further questions:
> 
> It seems this won't make any difference for a truly unbuffered stream,
> right? A truly unbuffered stream would not have a buffer where it
> could save the bytes that were prefetched past the stream position, so
> it wouldn't return any optional extra bytes, so there would be no
> speedup.

Indeed. But you can trivially wrap an unbuffered stream inside a
BufferedReader, and get peek() even when the raw stream is unseekable.

> And for a buffered stream, it would be much simpler to just
> read ahead in large chunks and seek back once you've found the end.

Well, no, only if your stream is seekable and seek() is fast enough.
So, it wouldn't work on SocketIO for example (even wrapped inside a
BufferedReader, since BufferedReader will refuse to seek() if seekable()
returns False).

> I
> wonder if it wouldn't be better to add an extra buffer to GzipFile so
> small seek() and read() calls can be made more efficient?

The problem is that, since the buffer of the unpickler and the buffer of
the GzipFile are not aware of each other, the unpickler could easily ask
to seek() backwards past the current GzipFile buffer, and fall back on
the slow algorithm.

The "extra buffer" can trivially consist in wrapping the GzipFile inside
a BufferedReader (which is actually recommended if you want e.g. very
fast readlines()), but it doesn't solve the above issue.

> In fact, this makes me curious as to the use that unpickling can make
> of the prefetch() call -- I suppose you had to implement some kind of
> layer on top of prefetch() that behaves more like a plain unbuffered
> file?

I didn't implement prefetch() at all. It would be prematurate :)
But, if the stream had prefetch(), the unpickling would be simplified: I
would only have to call prefetch() once when refilling the buffer,
rather than two read()'s followed by a peek().

(I could try to coalesce the two reads, but it would complicate the code
a bit more...)

> I want to push back on this more, primarily because a new primitive
> I/O operation has high costs: it can never be removed, it has to be
> added to every stream implementation, developers need to learn to use
> the new operation, and so on.

I agree with this (except that most developers don't really need to
learn to use it: common uses of readable files are content with read()
and readline(), and need neither peek() nor prefetch()). I don't intend
to push this for 3.2; I'm throwing the idea around with a hypothetical
3.3 landing if it seems useful.

> Also, if you can believe the multi-core crowd, a very different
> possible future development might be to run the gunzip algorithm and
> the unpickle algorithm in parallel, on separate cores. Truly such a
> solution would require totally *different* new I/O primitives, which
> might have a higher chance of being reusable outside the context of
> pickle.

Well, it's a bit of a pie-in-the-sky perspective :)
Furthermore, such a solution won't improve CPU efficiency, so if your
workload is already able to utilize all CPU cores (which it can easily
do if you are in a VM, or have multiple busy daemons), it doesn't bring
anything.

Regards

Antoine.