A fast way to read last line of gzip archive ?
Barak, Ron
Ron.Barak at lsi.com
Tue May 26 04:10:24 EDT 2009
Hi David,
Thanks for the below solutions: most illuminating.
I implemented your previous message suggestions, and already the processing time (on my datasets) is within acceptable human times.
I'll try your suggestion below.
Thanks again.
Ron.
> -----Original Message-----
> From: David Bolen [mailto:db3l.net at gmail.com]
> Sent: Tuesday, May 26, 2009 03:56
> To: python-list at python.org
> Subject: Re: A fast way to read last line of gzip archive ?
>
> "Barak, Ron" <Ron.Barak at lsi.com> writes:
>
> > I couldn't really go with the shell utilities approach, as
> I have no
> > say in my user environment, and thus cannot assume which
> binaries are
> > install on the user's machine.
>
> I suppose if you knew your target you could just supply the
> external binaries to go with your application, but I agree
> that would probably be more of a pain than its worth for the
> performance gain in real world time.
>
> > I'll try and implement your last suggestion, and see if the
> > performance is acceptable to (human) users.
>
> In terms of tuning the third option a bit, I'd play with the
> tracking of the final two chunk (as mentioned in my first
> response), perhaps shrinking the chunk size or only
> processing a smaller chunk of it for lines (assuming a
> reasonable line size) to minimize the final loop.
> You could also try using splitlines() on the final buffer
> rather than a StringIO wrapper, although that'll have a
> memory hit for the constructed list but doing a small portion
> of the buffer would minimize that.
>
> I was curious what I could actually achieve, so here are
> three variants that I came up with.
>
> First, this just fine tunes slightly tracking the chunks and
> then only processes enough final data based on anticipated
> maximum length length (so if the final line is longer than
> that you'll only get the final MAX_LINE bytes of that line).
> I also found I got better performance using a smaller 1024
> chunk size with GZipFile.read() than a MB - not entirely sure
> why although it perhaps matches the internal buffer size
> better:
>
> # last-chunk-2.py
>
> import gzip
> import sys
>
> CHUNK_SIZE = 1024
> MAX_LINE = 255
>
> in_file = gzip.open(sys.argv[1],'r')
>
> chunk = prior_chunk = ''
> while 1:
> prior_chunk = chunk
> # Note that CHUNK_SIZE here is in terms of decompressed data
> chunk = in_file.read(CHUNK_SIZE)
> if len(chunk) < CHUNK_SIZE:
> break
>
> if len(chunk) < MAX_LINE:
> chunk = prior_chunk + chunk
>
> line = chunk.splitlines(True)[-1]
> print 'Last:', line
>
>
> On the same test set as my last post, this reduced the
> last-chunk timing from about 2.7s to about 2.3s.
>
> Now, if you're willing to play a little looser with the gzip
> module, you can gain quite a bit more. If you directly call
> the internal _read() method you can bypass some of the
> unnecessary processing read() does, and go back to larger I/O chunks:
>
> # last-gzip.py
>
> import gzip
> import sys
>
> CHUNK_SIZE = 1024*1024
> MAX_LINE = 255
>
> in_file = gzip.open(sys.argv[1],'r')
>
> chunk = prior_chunk = ''
> while 1:
> try:
> # Note that CHUNK_SIZE here is raw data size, not
> decompressed
> in_file._read(CHUNK_SIZE)
> except EOFError:
> if in_file.extrasize < MAX_LINE:
> chunk = chunk + in_file.extrabuf
> else:
> chunk = in_file.extrabuf
> break
>
> chunk = in_file.extrabuf
> in_file.extrabuf = ''
> in_file.extrasize = 0
>
> line = chunk[-MAX_LINE:].splitlines(True)[-1]
> print 'Last:', line
>
> Note that in this case since I was able to bump up
> CHUNK_SIZE, I take a slice to limit the work splitlines() has
> to do and the size of the resulting list. Using the larger
> CHUNK_SIZE (and it being raw size) will use more memory, so
> could be tuned down if necessary.
>
> Of course, the risk here is that you are dependent on the
> _read() method, and the internal use of the
> extrabuf/extrasize attributes, which is where _read() places
> the decompressed data. In looking back I'm pretty sure this
> code is safe at least for Python 2.4 through 3.0, but you'd
> have to accept some risk in the future.
>
> This approach got me down to 1.48s.
>
> Then, just for the fun of it, once you're playing a little
> looser with the gzip module, it's also doing work to compute
> the crc of the original data for comparison with the
> decompressed data. If you don't mind so much about that
> (depends on what you're using the line for) you can just do
> your own raw decompression with the zlib module, as in the
> following code, although I still start with a GzipFile()
> object to avoid having to rewrite the header processing:
>
> # last-decompress.py
>
> import gzip
> import sys
> import zlib
>
> CHUNK_SIZE = 1024*1024
> MAX_LINE = 255
>
> decompress = zlib.decompressobj(-zlib.MAX_WBITS)
>
> in_file = gzip.open(sys.argv[1],'r')
> in_file._read_gzip_header()
>
> chunk = prior_chunk = ''
> while 1:
> buf = in_file.fileobj.read(CHUNK_SIZE)
> if not buf:
> break
> d_buf = decompress.decompress(buf)
> # We might not have been at EOF in the read() but
> still have no
> # decompressed data if the only remaining data was
> not original data
> if d_buf:
> prior_chunk = chunk
> chunk = d_buf
>
> if len(chunk) < MAX_LINE:
> chunk = prior_chunk + chunk
>
> line = chunk[-MAX_LINE:].splitlines(True)[-1]
> print 'Last:', line
>
> This version got me down to 1.15s.
>
> So in summary, the choices when tested on my system ended up at:
>
> last 26
> last-chunk 2.7
> last-chunk-2 2.3
> last-popen 1.7
> last-gzip 1.48
> last-decompress 1.12
>
> So by being willing to mix in some more direct code with the
> GzipFile object, I was able to beat the overhead of shelling
> out to the faster utilities, while remaining in pure Python.
>
> -- David
>
>
More information about the Python-list
mailing list