[Python-ideas] Prefetching on buffered IO files

Tue Sep 28 22:33:39 CEST 2010

On Tue, 28 Sep 2010 09:44:38 -0700
Guido van Rossum <guido at python.org> wrote:
> 
> But AFAICT unpickle doesn't use seek()?
> 
> [...]
> > But, if the stream had prefetch(), the unpickling would be simplified: I
> > would only have to call prefetch() once when refilling the buffer,
> > rather than two read()'s followed by a peek().
> >
> > (I could try to coalesce the two reads, but it would complicate the code
> > a bit more...)
> 
> Where exactly would the peek be used? (I must be confused because I
> can't find either peek or seek in _pickle.c.)

peek/seek are not used currently (in SVN). Each of them is used in
one of the prefetching approaches proposed to solve the unpickling
performance problem.

(the first approach uses seek() and read(), the second approach uses
read() and peek(); as already explained, I tend to consider the second
approach much better, and the prefetch() proposal comes in part from the
experience gathered on that approach)

> It still seems to me that the "right" way to solve this would be to
> insert a transparent extra buffer somewhere, probably in the GzipFile
> code, and work in reducing the call overhead.

No, because if you don't have any buffering on the unpickling side
(rather than the GzipFile or the BufferedReader side), then you still
have the method call overhead no matter what. And this overhead is
rather big when you're reading data byte per byte, or word per word
(which unpickling very frequently does).

(for the record, GzipFile already has an internal buffer. But calling
GzipFile.read() still has a large overhead compared to reading
data directly from a prefetch buffer inside the unpickler object)

Regards

Antoine.