Re: [Python-ideas] Prefetching on buffered IO files

Hello, (moved to python-ideas) On Mon, 27 Sep 2010 17:39:45 -0700 Guido van Rossum <guido@python.org> wrote:
The bug entry is quite old and at first the slowness had to do with the pure Python IO layer. Now the remaining performance difference with Python 2 is entirely caused by the following core issue:
Right.
The worst case is backwards seeks. Forward seeks are implemented as a simply read(), which makes them O(k) where k is the displacement. For buffering applications where k is bounded by the buffer size, it is O(1) (still with, of course, a non-trivial multiplier). Backwards seeks are implemented as rewinding the whole file (seek(0)) and then reading again up to the requested position, which makes them O(n) with n the absolute target position. When your requirement is to rewind by a bounded number of bytes in order to undo some readahead, this is rather catastrophic. I don't know how the gzip algorithm works under the hood; my impression is that optimizing backwards seeks would have us save us checkpoints of the decompressor state and restore it if needed. It doesn't sound like a trivial improvement, and would involve tradeoffs w.r.t. to performance of sequential reads. (I haven't looked at BZ2File, which has a totally different -- and outdated -- implementation) It's why I would favour the peek() (or peek()-like, as in the prefetch() idea) approach anyway. Not only it works on unseekable files, but implementing peek() when you have an internal buffer is quite simple (see GzipFile.peek here: http://bugs.python.org/issue9962). peek() could also be added to BytesIO even though it claims to implement RawIOBase rather than BufferedIOBase. (buf of course, when you have a BytesIO, you can simply feed its getvalue() or getbuffer() directly to pickle.loads)
I really don't know. But I don't think we can break the API for a special case without potentially causing nasty surprises for the user. Also, my intuition is that pickling directly from a stream is partly meant for cases where you want to access data following the pickle data in the stream.
It is actually quite good compared to the statu quo (3x to 10x), and as good as the seek/read solution for regular files (and, of course, much better for gzipped files once GzipFile.peek is implemented): http://bugs.python.org/issue3873#msg117483 So, for solving the unpickle performance issue, it is sufficient. Chances are the bottleneck for further improvements would be in the unpickling logic itself. It feels a bit clunky, though. Direct timing shows that peek()+read() has a non-trivial cost compared to read(): $ ./python -m timeit -s "f=open('Misc/HISTORY', 'rb')" "f.seek(0)" \ "while f.read(4096): pass" 1000 loops, best of 3: 277 usec per loop $ ./python -m timeit -s "f=open('Misc/HISTORY', 'rb')" "f.seek(0)" \ "while f.read(4096): f.peek(4096)" 1000 loops, best of 3: 361 usec per loop (that's on a C extension type where peek() is almost a single call to PyBytes_FromStringAndSize)
The general idea with buffering is that: - you want to skip the previously prefetched bytes (through peek() or prefetch()) which have been consumed -> hence the `skip` argument - you want to consume a known number of bytes from the stream (for example a 4-bytes little-endian integer) -> hence the `minread` argument - you would like to prefetch some more bytes if cheaply possible, so as to avoid calling read() or prefetch() too much; but you don't know yet if you will consume those bytes, so the file pointer shouldn't be advanced for them If you don't prefetch more than the minimum needed amount of bytes, you don't solve the performance problem at all (unpickling needs many tiny reads). If you advance the file pointer after the whole prefetched data (even though it might not be entirely consumed), you need to seek() back at the end: it doesn't work on unseekable files, and is very slow on some seekable file types. So, the proposal is like a combination of forward seek() + read() + peek() in a single call. With the advantages that: - it works on non-seekable files (things like SocketIO) - it allows the caller to operate in its own buffer (this is nice in C) - it returns the data naturally concatenated, so you don't have to do it yourself if needed - it gives more guarantees than peek() as to the min and max number of bytes returned; peek(), as it is not allowed to advance the file pointer, can return as little as 1 byte (even if you ask for 4096, and even if EOF isn't reached) I also find it interesting that implementing a single primitive be enough for creating custom buffered types (by deriving other methods from it), but the aesthetics of this can be controversial. Regards Antoine.

On Tue, Sep 28, 2010 at 5:57 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Thanks for the long explanation. I have some further questions: It seems this won't make any difference for a truly unbuffered stream, right? A truly unbuffered stream would not have a buffer where it could save the bytes that were prefetched past the stream position, so it wouldn't return any optional extra bytes, so there would be no speedup. And for a buffered stream, it would be much simpler to just read ahead in large chunks and seek back once you've found the end. (Actually for a buffered stream I suppose that many short read() and small seek() calls aren't actually slow since most of the time they work within the buffer.) So it seems the API is specifically designed to improve the situation with GzipFile since it maintains the fiction of an unbuffered file but in fact has some internal buffer space. I wonder if it wouldn't be better to add an extra buffer to GzipFile so small seek() and read() calls can be made more efficient? In fact, this makes me curious as to the use that unpickling can make of the prefetch() call -- I suppose you had to implement some kind of layer on top of prefetch() that behaves more like a plain unbuffered file? I want to push back on this more, primarily because a new primitive I/O operation has high costs: it can never be removed, it has to be added to every stream implementation, developers need to learn to use the new operation, and so on. A local change that only affects GzipFile doesn't have any of these problems. Also, if you can believe the multi-core crowd, a very different possible future development might be to run the gunzip algorithm and the unpickle algorithm in parallel, on separate cores. Truly such a solution would require totally *different* new I/O primitives, which might have a higher chance of being reusable outside the context of pickle. -- --Guido van Rossum (python.org/~guido)

Le mardi 28 septembre 2010 à 07:08 -0700, Guido van Rossum a écrit :
Indeed. But you can trivially wrap an unbuffered stream inside a BufferedReader, and get peek() even when the raw stream is unseekable.
And for a buffered stream, it would be much simpler to just read ahead in large chunks and seek back once you've found the end.
Well, no, only if your stream is seekable and seek() is fast enough. So, it wouldn't work on SocketIO for example (even wrapped inside a BufferedReader, since BufferedReader will refuse to seek() if seekable() returns False).
The problem is that, since the buffer of the unpickler and the buffer of the GzipFile are not aware of each other, the unpickler could easily ask to seek() backwards past the current GzipFile buffer, and fall back on the slow algorithm. The "extra buffer" can trivially consist in wrapping the GzipFile inside a BufferedReader (which is actually recommended if you want e.g. very fast readlines()), but it doesn't solve the above issue.
I didn't implement prefetch() at all. It would be prematurate :) But, if the stream had prefetch(), the unpickling would be simplified: I would only have to call prefetch() once when refilling the buffer, rather than two read()'s followed by a peek(). (I could try to coalesce the two reads, but it would complicate the code a bit more...)
I agree with this (except that most developers don't really need to learn to use it: common uses of readable files are content with read() and readline(), and need neither peek() nor prefetch()). I don't intend to push this for 3.2; I'm throwing the idea around with a hypothetical 3.3 landing if it seems useful.
Well, it's a bit of a pie-in-the-sky perspective :) Furthermore, such a solution won't improve CPU efficiency, so if your workload is already able to utilize all CPU cores (which it can easily do if you are in a VM, or have multiple busy daemons), it doesn't bring anything. Regards Antoine.

On Tue, Sep 28, 2010 at 7:32 AM, Antoine Pitrou <solipsis@pitrou.net> wrote: [Guido]
But AFAICT unpickle doesn't use seek()? [...]
Where exactly would the peek be used? (I must be confused because I can't find either peek or seek in _pickle.c.) It still seems to me that the "right" way to solve this would be to insert a transparent extra buffer somewhere, probably in the GzipFile code, and work in reducing the call overhead.
So far it seems more awkward than useful.
Agreed it's pie in the sky... Though the interface between the two CPUs might actually be designed to be faster than the current buffered I/O. I have (mostly :-) fond memories of async I/O on a mainframe I used in the '70s which worked this way. -- --Guido van Rossum (python.org/~guido)

On Tue, 28 Sep 2010 09:44:38 -0700 Guido van Rossum <guido@python.org> wrote:
peek/seek are not used currently (in SVN). Each of them is used in one of the prefetching approaches proposed to solve the unpickling performance problem. (the first approach uses seek() and read(), the second approach uses read() and peek(); as already explained, I tend to consider the second approach much better, and the prefetch() proposal comes in part from the experience gathered on that approach)
No, because if you don't have any buffering on the unpickling side (rather than the GzipFile or the BufferedReader side), then you still have the method call overhead no matter what. And this overhead is rather big when you're reading data byte per byte, or word per word (which unpickling very frequently does). (for the record, GzipFile already has an internal buffer. But calling GzipFile.read() still has a large overhead compared to reading data directly from a prefetch buffer inside the unpickler object) Regards Antoine.

On Tue, Sep 28, 2010 at 5:57 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Thanks for the long explanation. I have some further questions: It seems this won't make any difference for a truly unbuffered stream, right? A truly unbuffered stream would not have a buffer where it could save the bytes that were prefetched past the stream position, so it wouldn't return any optional extra bytes, so there would be no speedup. And for a buffered stream, it would be much simpler to just read ahead in large chunks and seek back once you've found the end. (Actually for a buffered stream I suppose that many short read() and small seek() calls aren't actually slow since most of the time they work within the buffer.) So it seems the API is specifically designed to improve the situation with GzipFile since it maintains the fiction of an unbuffered file but in fact has some internal buffer space. I wonder if it wouldn't be better to add an extra buffer to GzipFile so small seek() and read() calls can be made more efficient? In fact, this makes me curious as to the use that unpickling can make of the prefetch() call -- I suppose you had to implement some kind of layer on top of prefetch() that behaves more like a plain unbuffered file? I want to push back on this more, primarily because a new primitive I/O operation has high costs: it can never be removed, it has to be added to every stream implementation, developers need to learn to use the new operation, and so on. A local change that only affects GzipFile doesn't have any of these problems. Also, if you can believe the multi-core crowd, a very different possible future development might be to run the gunzip algorithm and the unpickle algorithm in parallel, on separate cores. Truly such a solution would require totally *different* new I/O primitives, which might have a higher chance of being reusable outside the context of pickle. -- --Guido van Rossum (python.org/~guido)

Le mardi 28 septembre 2010 à 07:08 -0700, Guido van Rossum a écrit :
Indeed. But you can trivially wrap an unbuffered stream inside a BufferedReader, and get peek() even when the raw stream is unseekable.
And for a buffered stream, it would be much simpler to just read ahead in large chunks and seek back once you've found the end.
Well, no, only if your stream is seekable and seek() is fast enough. So, it wouldn't work on SocketIO for example (even wrapped inside a BufferedReader, since BufferedReader will refuse to seek() if seekable() returns False).
The problem is that, since the buffer of the unpickler and the buffer of the GzipFile are not aware of each other, the unpickler could easily ask to seek() backwards past the current GzipFile buffer, and fall back on the slow algorithm. The "extra buffer" can trivially consist in wrapping the GzipFile inside a BufferedReader (which is actually recommended if you want e.g. very fast readlines()), but it doesn't solve the above issue.
I didn't implement prefetch() at all. It would be prematurate :) But, if the stream had prefetch(), the unpickling would be simplified: I would only have to call prefetch() once when refilling the buffer, rather than two read()'s followed by a peek(). (I could try to coalesce the two reads, but it would complicate the code a bit more...)
I agree with this (except that most developers don't really need to learn to use it: common uses of readable files are content with read() and readline(), and need neither peek() nor prefetch()). I don't intend to push this for 3.2; I'm throwing the idea around with a hypothetical 3.3 landing if it seems useful.
Well, it's a bit of a pie-in-the-sky perspective :) Furthermore, such a solution won't improve CPU efficiency, so if your workload is already able to utilize all CPU cores (which it can easily do if you are in a VM, or have multiple busy daemons), it doesn't bring anything. Regards Antoine.

On Tue, Sep 28, 2010 at 7:32 AM, Antoine Pitrou <solipsis@pitrou.net> wrote: [Guido]
But AFAICT unpickle doesn't use seek()? [...]
Where exactly would the peek be used? (I must be confused because I can't find either peek or seek in _pickle.c.) It still seems to me that the "right" way to solve this would be to insert a transparent extra buffer somewhere, probably in the GzipFile code, and work in reducing the call overhead.
So far it seems more awkward than useful.
Agreed it's pie in the sky... Though the interface between the two CPUs might actually be designed to be faster than the current buffered I/O. I have (mostly :-) fond memories of async I/O on a mainframe I used in the '70s which worked this way. -- --Guido van Rossum (python.org/~guido)

On Tue, 28 Sep 2010 09:44:38 -0700 Guido van Rossum <guido@python.org> wrote:
peek/seek are not used currently (in SVN). Each of them is used in one of the prefetching approaches proposed to solve the unpickling performance problem. (the first approach uses seek() and read(), the second approach uses read() and peek(); as already explained, I tend to consider the second approach much better, and the prefetch() proposal comes in part from the experience gathered on that approach)
No, because if you don't have any buffering on the unpickling side (rather than the GzipFile or the BufferedReader side), then you still have the method call overhead no matter what. And this overhead is rather big when you're reading data byte per byte, or word per word (which unpickling very frequently does). (for the record, GzipFile already has an internal buffer. But calling GzipFile.read() still has a large overhead compared to reading data directly from a prefetch buffer inside the unpickler object) Regards Antoine.
participants (2)
-
Antoine Pitrou
-
Guido van Rossum