A fast way to read last line of gzip archive ?

Tue May 26 04:10:24 EDT 2009

Hi David, 
Thanks for the below solutions: most illuminating.
I implemented your previous message suggestions, and already the processing time (on my datasets) is within acceptable human times.
I'll try your suggestion below.
Thanks again.
Ron.

> -----Original Message-----
> From: David Bolen [mailto:db3l.net at gmail.com] 
> Sent: Tuesday, May 26, 2009 03:56
> To: python-list at python.org
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> "Barak, Ron" <Ron.Barak at lsi.com> writes:
> 
> > I couldn't really go with the shell utilities approach, as 
> I have no 
> > say in my user environment, and thus cannot assume which 
> binaries are 
> > install on the user's machine.
> 
> I suppose if you knew your target you could just supply the 
> external binaries to go with your application, but I agree 
> that would probably be more of a pain than its worth for the 
> performance gain in real world time.
> 
> > I'll try and implement your last suggestion, and see if the 
> > performance is acceptable to (human) users.
> 
> In terms of tuning the third option a bit, I'd play with the 
> tracking of the final two chunk (as mentioned in my first 
> response), perhaps shrinking the chunk size or only 
> processing a smaller chunk of it for lines (assuming a 
> reasonable line size) to minimize the final loop.
> You could also try using splitlines() on the final buffer 
> rather than a StringIO wrapper, although that'll have a 
> memory hit for the constructed list but doing a small portion 
> of the buffer would minimize that.
> 
> I was curious what I could actually achieve, so here are 
> three variants that I came up with.
> 
> First, this just fine tunes slightly tracking the chunks and 
> then only processes enough final data based on anticipated 
> maximum length length (so if the final line is longer than 
> that you'll only get the final MAX_LINE bytes of that line).  
> I also found I got better performance using a smaller 1024 
> chunk size with GZipFile.read() than a MB - not entirely sure 
> why although it perhaps matches the internal buffer size
> better:
> 
>     # last-chunk-2.py
> 
>     import gzip
>     import sys
> 
>     CHUNK_SIZE = 1024
>     MAX_LINE = 255
> 
>     in_file = gzip.open(sys.argv[1],'r')
> 
>     chunk = prior_chunk = ''
>     while 1:
>         prior_chunk = chunk
>         # Note that CHUNK_SIZE here is in terms of decompressed data
>         chunk = in_file.read(CHUNK_SIZE)
>         if len(chunk) < CHUNK_SIZE:
>             break
> 
>     if len(chunk) < MAX_LINE:
>         chunk = prior_chunk + chunk
> 
>     line = chunk.splitlines(True)[-1]
>     print 'Last:', line
> 
> 
> On the same test set as my last post, this reduced the 
> last-chunk timing from about 2.7s to about 2.3s.
> 
> Now, if you're willing to play a little looser with the gzip 
> module, you can gain quite a bit more.  If you directly call 
> the internal _read() method you can bypass some of the 
> unnecessary processing read() does, and go back to larger I/O chunks:
> 
>     # last-gzip.py
> 
>     import gzip
>     import sys
> 
>     CHUNK_SIZE = 1024*1024
>     MAX_LINE = 255
> 
>     in_file = gzip.open(sys.argv[1],'r')
> 
>     chunk = prior_chunk = ''
>     while 1:
>         try:
>             # Note that CHUNK_SIZE here is raw data size, not 
> decompressed
>             in_file._read(CHUNK_SIZE)
>         except EOFError:
>             if in_file.extrasize < MAX_LINE:
>                 chunk = chunk + in_file.extrabuf
>             else:
>                 chunk = in_file.extrabuf
>             break
> 
>         chunk = in_file.extrabuf
>         in_file.extrabuf = ''
>         in_file.extrasize = 0
> 
>     line = chunk[-MAX_LINE:].splitlines(True)[-1]
>     print 'Last:', line
> 
> Note that in this case since I was able to bump up 
> CHUNK_SIZE, I take a slice to limit the work splitlines() has 
> to do and the size of the resulting list.  Using the larger 
> CHUNK_SIZE (and it being raw size) will use more memory, so 
> could be tuned down if necessary.
> 
> Of course, the risk here is that you are dependent on the 
> _read() method, and the internal use of the 
> extrabuf/extrasize attributes, which is where _read() places 
> the decompressed data.  In looking back I'm pretty sure this 
> code is safe at least for Python 2.4 through 3.0, but you'd 
> have to accept some risk in the future.
> 
> This approach got me down to 1.48s.
> 
> Then, just for the fun of it, once you're playing a little 
> looser with the gzip module, it's also doing work to compute 
> the crc of the original data for comparison with the 
> decompressed data.  If you don't mind so much about that 
> (depends on what you're using the line for) you can just do 
> your own raw decompression with the zlib module, as in the 
> following code, although I still start with a GzipFile() 
> object to avoid having to rewrite the header processing:
> 
>     # last-decompress.py
> 
>     import gzip
>     import sys
>     import zlib
> 
>     CHUNK_SIZE = 1024*1024
>     MAX_LINE = 255
> 
>     decompress = zlib.decompressobj(-zlib.MAX_WBITS)
> 
>     in_file = gzip.open(sys.argv[1],'r')
>     in_file._read_gzip_header()
> 
>     chunk = prior_chunk = ''
>     while 1:
>         buf = in_file.fileobj.read(CHUNK_SIZE)
>         if not buf:
>             break
>         d_buf = decompress.decompress(buf)
>         # We might not have been at EOF in the read() but 
> still have no
>         # decompressed data if the only remaining data was 
> not original data
>         if d_buf:
>             prior_chunk = chunk
>             chunk = d_buf
> 
>     if len(chunk) < MAX_LINE:
>         chunk = prior_chunk + chunk
> 
>     line = chunk[-MAX_LINE:].splitlines(True)[-1]
>     print 'Last:', line
> 
> This version got me down to 1.15s.
> 
> So in summary, the choices when tested on my system ended up at:
> 
>     last             26
>     last-chunk        2.7
>     last-chunk-2      2.3
>     last-popen        1.7
>     last-gzip         1.48
>     last-decompress   1.12
> 
> So by being willing to mix in some more direct code with the 
> GzipFile object, I was able to beat the overhead of shelling 
> out to the faster utilities, while remaining in pure Python.
> 
> -- David
> 
>