A fast way to read last line of gzip archive ?

David Bolen db3l.net at gmail.com
Mon May 25 00:58:03 CEST 2009


"Barak, Ron" <Ron.Barak at lsi.com> writes:

> I thought maybe someone has a way to unzip just the end portion of
> the archive (instead of the whole archive), as only the last part is
> needed for reading the last line.

The problem is that gzip compressed output has no reliable
intermediate break points that you can jump to and just start
decompressing without having worked through the prior data.

In your specific code, using readlines() is probably not ideal as it
will create the full list containing all of the decoded file contents
in memory only to let you pick the last one.  So a small optimization
would be to just iterate through the file (directly or by calling
readline()) until you reach the last line.

However, since you don't care about the bulk of the file, but only
need to work with the final line in Python, this is an activity that
could be handled more efficiently handled with external tools, as you
need not involve much intepreter time to actually decompress/discard
the bulk of the file.

For example, on my system, comparing these two cases:

    # last.py

    import gzip
    import sys

    in_file = gzip.open(sys.argv[1],'r')
    for line in in_file:
        pass
    print 'Last:', line


    # last-popen.py

    import sys
    from subprocess import Popen, PIPE

    # Implement gzip -dc <file> | tail -1
    gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE)
    tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE)
    line = tail.communicate()[0]
    print 'Last:', line

with an ~80MB log file compressed to about 8MB resulted in last.py
taking about 26 seconds, while last-popen took about 1.7s.  Both
resulted in the same value in "line".  As long as you have local
binaries for gzip/tail (such as Cygwin or MingW or equivalent) this
works fine on Windows systems too.

If you really want to keep everything in Python, then I'd suggest
working to optimize the "skip" portion of the task, trying to
decompress the bulk of the file as quickly as possible.  For example,
one possibility would be something like:

    # last-chunk.py
    
    import gzip
    import sys
    from cStringIO import StringIO

    in_file = gzip.open(sys.argv[1],'r')

    chunks = ['', '']
    while 1:
        chunk = in_file.read(1024*1024)
        if not chunk:
            break
        del chunks[0]
        chunks.append(chunk)

    data = StringIO(''.join(chunks))
    for line in data:
        pass
    print 'Last:', line

with the idea that you decode about a MB at a time, holding onto the
final two chunks (in case the actual final chunk turns out to be
smaller than one of your lines), and then only process those for
lines.  There's probably some room for tweaking the mechanism for
holding onto just the last two chunks, but I'm not sure it will make
a major difference in performance.

In the same environment of mine as the earlier tests, the above took
about 2.7s.  So still much slower than the external utilities in
percentage terms, but in absolute terms, a second or so may not be
critical for you compared to pure Python.

-- David



More information about the Python-list mailing list