[Python-ideas] Prefetching on buffered IO files

Tue Sep 28 16:08:08 CEST 2010

On Tue, Sep 28, 2010 at 5:57 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Mon, 27 Sep 2010 17:39:45 -0700
> Guido van Rossum <guido at python.org> wrote:
>> On Mon, Sep 27, 2010 at 3:41 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> > While trying to solve #3873 (poor performance of pickle on file
>> > objects, due to the overhead of calling read() with very small values),
>> > it occurred to me that the prefetching facilities offered by
>> > BufferedIOBase are not flexible and efficient enough.
>>
>> I haven't read the whole bug but there seem to be lots of different
>> smaller issues there, right?
>
> The bug entry is quite old and at first the slowness had to do with the
> pure Python IO layer. Now the remaining performance difference with
> Python 2 is entirely caused by the following core issue:
>
>> It seems that one (unfortunate)
>> constraint is that reading pickles cannot use buffered I/O (at least
>> not on a non-seekable file) because the API has been documented to
>> leave the file positioned right after the last byte of the pickled
>> data, right?
>
> Right.
>
>> > Indeed, if you use seek() and read(), 1) you limit yourself to seekable
>> > files 2) performance can be hampered by very bad seek() performance
>> > (this is true on GzipFile).
>>
>> Ow... I've always assumed that seek() is essentially free, because
>> that's how a typical OS kernel implements it. If seek() is bad on
>> GzipFile, how hard would it be to fix this?
>
> The worst case is backwards seeks. Forward seeks are implemented as a
> simply read(), which makes them O(k) where k is the displacement. For
> buffering applications where k is bounded by the buffer size, it is
> O(1) (still with, of course, a non-trivial multiplier).
>
> Backwards seeks are implemented as rewinding the whole file (seek(0))
> and then reading again up to the requested position, which makes them
> O(n) with n the absolute target position. When your requirement is to
> rewind by a bounded number of bytes in order to undo some readahead,
> this is rather catastrophic.
>
> I don't know how the gzip algorithm works under the hood; my impression
> is that optimizing backwards seeks would have us save us checkpoints of
> the decompressor state and restore it if needed. It doesn't sound like a
> trivial improvement, and would involve tradeoffs w.r.t. to
> performance of sequential reads.
>
>  (I haven't looked at BZ2File, which has a totally different -- and
>  outdated -- implementation)
>
> It's why I would favour the peek() (or peek()-like, as in the prefetch()
> idea) approach anyway. Not only it works on unseekable files, but
> implementing peek() when you have an internal buffer is quite simple
> (see GzipFile.peek here: http://bugs.python.org/issue9962).
>
> peek() could also be added to BytesIO even though it claims to
> implement RawIOBase rather than BufferedIOBase.
> (buf of course, when you have a BytesIO, you can simply feed its
> getvalue() or getbuffer() directly to pickle.loads)
>
>> How common is the use case where you need to read a gzipped pickle
>> *and* you need to leave the unzipped stream positioned exactly at the
>> end of the pickle?
>
> I really don't know. But I don't think we can break the API for a
> special case without potentially causing nasty surprises for the user.
>
> Also, my intuition is that pickling directly from a stream is partly
> meant for cases where you want to access data following the pickle
> data in the stream.
>
>> > If instead you use peek() and read(), the situation is better, but you
>> > end up doing multiple copies of data; also, you must call read() to
>> > advance the file pointer even though you don't care about the results.
>>
>> Have you measured how bad the situation is if you do implement it this way?
>
> It is actually quite good compared to the statu quo (3x to 10x), and as
> good as the seek/read solution for regular files (and, of course, much
> better for gzipped files once GzipFile.peek is implemented):
> http://bugs.python.org/issue3873#msg117483
>
> So, for solving the unpickle performance issue, it is sufficient.
> Chances are the bottleneck for further improvements would be in the
> unpickling logic itself. It feels a bit clunky, though.
>
> Direct timing shows that peek()+read() has a non-trivial cost compared
> to read():
>
> $ ./python -m timeit -s "f=open('Misc/HISTORY', 'rb')" "f.seek(0)" \
>  "while f.read(4096): pass"
> 1000 loops, best of 3: 277 usec per loop
> $ ./python -m timeit -s "f=open('Misc/HISTORY', 'rb')" "f.seek(0)" \
>  "while f.read(4096): f.peek(4096)"
> 1000 loops, best of 3: 361 usec per loop
>
> (that's on a C extension type where peek() is almost a single call to
> PyBytes_FromStringAndSize)
>
>> > So I would propose adding the following method to BufferedIOBase:
>> >
>> > prefetch(self, buffer, skip, minread)
>> >
>> > Skip `skip` bytes from the stream.  Then, try to read at
>> > least `minread` bytes and write them into `buffer`. The file
>> > pointer is advanced by at most `skip + minread`, or less if
>> > the end of file was reached. The total number of bytes written
>> > in `buffer` is returned, which can be more than `minread`
>> > if additional bytes could be prefetched (but, of course,
>> > cannot be more than `len(buffer)`).
>> >
>> > Arguments:
>> > - `buffer`: a writable buffer (e.g. bytearray)
>> > - `skip`: number of bytes to skip (must be >= 0)
>> > - `minread`: number of bytes to read (must be >= 0 and <= len(buffer))
>>
>> I like the idea of an API that combines seek and read into a mutable
>> buffer. However the semantics of this call seem really weird: there is
>> no direct relationship between where it leaves the stream position and
>> how much data it reads into the buffer. can you explain how exactly
>> this will help solve the gzipped pickle performance problem?
>
> The general idea with buffering is that:
> - you want to skip the previously prefetched bytes (through peek()
>  or prefetch()) which have been consumed -> hence the `skip` argument
> - you want to consume a known number of bytes from the stream (for
>  example a 4-bytes little-endian integer) -> hence the `minread`
>  argument
> - you would like to prefetch some more bytes if cheaply possible, so as
>  to avoid calling read() or prefetch() too much; but you don't know
>  yet if you will consume those bytes, so the file pointer shouldn't be
>  advanced for them
>
> If you don't prefetch more than the minimum needed amount of bytes, you
> don't solve the performance problem at all (unpickling needs many tiny
> reads). If you advance the file pointer after the whole prefetched data
> (even though it might not be entirely consumed), you need to seek()
> back at the end: it doesn't work on unseekable files, and is very slow
> on some seekable file types.
>
> So, the proposal is like a combination of forward seek() + read() +
> peek() in a single call. With the advantages that:
> - it works on non-seekable files (things like SocketIO)
> - it allows the caller to operate in its own buffer (this is nice in C)
> - it returns the data naturally concatenated, so you don't have to do
>  it yourself if needed
> - it gives more guarantees than peek() as to the min and max number of
>  bytes returned; peek(), as it is not allowed to advance the file
>  pointer, can return as little as 1 byte (even if you ask for 4096,
>  and even if EOF isn't reached)
>
> I also find it interesting that implementing a single primitive be
> enough for creating custom buffered types (by deriving other methods
> from it), but the aesthetics of this can be controversial.

Thanks for the long explanation. I have some further questions:

It seems this won't make any difference for a truly unbuffered stream,
right? A truly unbuffered stream would not have a buffer where it
could save the bytes that were prefetched past the stream position, so
it wouldn't return any optional extra bytes, so there would be no
speedup. And for a buffered stream, it would be much simpler to just
read ahead in large chunks and seek back once you've found the end.
(Actually for a buffered stream I suppose that many short read() and
small seek() calls aren't actually slow since most of the time they
work within the buffer.) So it seems the API is specifically designed
to improve the situation with GzipFile since it maintains the fiction
of an unbuffered file but in fact has some internal buffer space. I
wonder if it wouldn't be better to add an extra buffer to GzipFile so
small seek() and read() calls can be made more efficient?

In fact, this makes me curious as to the use that unpickling can make
of the prefetch() call -- I suppose you had to implement some kind of
layer on top of prefetch() that behaves more like a plain unbuffered
file?

I want to push back on this more, primarily because a new primitive
I/O operation has high costs: it can never be removed, it has to be
added to every stream implementation, developers need to learn to use
the new operation, and so on. A local change that only affects
GzipFile doesn't have any of these problems.

Also, if you can believe the multi-core crowd, a very different
possible future development might be to run the gunzip algorithm and
the unpickle algorithm in parallel, on separate cores. Truly such a
solution would require totally *different* new I/O primitives, which
might have a higher chance of being reusable outside the context of
pickle.

-- 
--Guido van Rossum (python.org/~guido)