A fast way to read last line of gzip archive ?
David Bolen
db3l.net at gmail.com
Mon May 25 20:56:02 EDT 2009
"Barak, Ron" <Ron.Barak at lsi.com> writes:
> I couldn't really go with the shell utilities approach, as I have no
> say in my user environment, and thus cannot assume which binaries
> are install on the user's machine.
I suppose if you knew your target you could just supply the external
binaries to go with your application, but I agree that would probably
be more of a pain than its worth for the performance gain in real
world time.
> I'll try and implement your last suggestion, and see if the
> performance is acceptable to (human) users.
In terms of tuning the third option a bit, I'd play with the tracking
of the final two chunk (as mentioned in my first response), perhaps
shrinking the chunk size or only processing a smaller chunk of it for
lines (assuming a reasonable line size) to minimize the final loop.
You could also try using splitlines() on the final buffer rather than
a StringIO wrapper, although that'll have a memory hit for the
constructed list but doing a small portion of the buffer would
minimize that.
I was curious what I could actually achieve, so here are three variants
that I came up with.
First, this just fine tunes slightly tracking the chunks and then only
processes enough final data based on anticipated maximum length length
(so if the final line is longer than that you'll only get the final
MAX_LINE bytes of that line). I also found I got better performance
using a smaller 1024 chunk size with GZipFile.read() than a MB - not
entirely sure why although it perhaps matches the internal buffer size
better:
# last-chunk-2.py
import gzip
import sys
CHUNK_SIZE = 1024
MAX_LINE = 255
in_file = gzip.open(sys.argv[1],'r')
chunk = prior_chunk = ''
while 1:
prior_chunk = chunk
# Note that CHUNK_SIZE here is in terms of decompressed data
chunk = in_file.read(CHUNK_SIZE)
if len(chunk) < CHUNK_SIZE:
break
if len(chunk) < MAX_LINE:
chunk = prior_chunk + chunk
line = chunk.splitlines(True)[-1]
print 'Last:', line
On the same test set as my last post, this reduced the last-chunk
timing from about 2.7s to about 2.3s.
Now, if you're willing to play a little looser with the gzip module,
you can gain quite a bit more. If you directly call the internal _read()
method you can bypass some of the unnecessary processing read() does, and
go back to larger I/O chunks:
# last-gzip.py
import gzip
import sys
CHUNK_SIZE = 1024*1024
MAX_LINE = 255
in_file = gzip.open(sys.argv[1],'r')
chunk = prior_chunk = ''
while 1:
try:
# Note that CHUNK_SIZE here is raw data size, not decompressed
in_file._read(CHUNK_SIZE)
except EOFError:
if in_file.extrasize < MAX_LINE:
chunk = chunk + in_file.extrabuf
else:
chunk = in_file.extrabuf
break
chunk = in_file.extrabuf
in_file.extrabuf = ''
in_file.extrasize = 0
line = chunk[-MAX_LINE:].splitlines(True)[-1]
print 'Last:', line
Note that in this case since I was able to bump up CHUNK_SIZE, I take
a slice to limit the work splitlines() has to do and the size of the
resulting list. Using the larger CHUNK_SIZE (and it being raw size) will
use more memory, so could be tuned down if necessary.
Of course, the risk here is that you are dependent on the _read()
method, and the internal use of the extrabuf/extrasize attributes,
which is where _read() places the decompressed data. In looking back
I'm pretty sure this code is safe at least for Python 2.4 through 3.0,
but you'd have to accept some risk in the future.
This approach got me down to 1.48s.
Then, just for the fun of it, once you're playing a little looser with
the gzip module, it's also doing work to compute the crc of the
original data for comparison with the decompressed data. If you don't
mind so much about that (depends on what you're using the line for)
you can just do your own raw decompression with the zlib module, as in
the following code, although I still start with a GzipFile() object to
avoid having to rewrite the header processing:
# last-decompress.py
import gzip
import sys
import zlib
CHUNK_SIZE = 1024*1024
MAX_LINE = 255
decompress = zlib.decompressobj(-zlib.MAX_WBITS)
in_file = gzip.open(sys.argv[1],'r')
in_file._read_gzip_header()
chunk = prior_chunk = ''
while 1:
buf = in_file.fileobj.read(CHUNK_SIZE)
if not buf:
break
d_buf = decompress.decompress(buf)
# We might not have been at EOF in the read() but still have no
# decompressed data if the only remaining data was not original data
if d_buf:
prior_chunk = chunk
chunk = d_buf
if len(chunk) < MAX_LINE:
chunk = prior_chunk + chunk
line = chunk[-MAX_LINE:].splitlines(True)[-1]
print 'Last:', line
This version got me down to 1.15s.
So in summary, the choices when tested on my system ended up at:
last 26
last-chunk 2.7
last-chunk-2 2.3
last-popen 1.7
last-gzip 1.48
last-decompress 1.12
So by being willing to mix in some more direct code with the GzipFile
object, I was able to beat the overhead of shelling out to the faster
utilities, while remaining in pure Python.
-- David
More information about the Python-list
mailing list