Python vs. Java gzip performance

Andrew MacIntyre andymac at bullseye.apana.org.au
Fri Mar 17 20:50:37 EST 2006


Bill wrote:
> I've written a small program that, in part, reads in a file and parses
> it.  Sometimes, the file is gzipped.  The code that I use to get the
> file object is like so:
> 
> if filename.endswith(".gz"):
>     file = GzipFile(filename)
> else:
>     file = open(filename)
> 
> Then I parse the contents of the file in the usual way (for line in
> file:...)
> 
> The equivalent Java code goes like this:
> 
> if (isZipped(aFile)) {
>     input = new BufferedReader(new InputStreamReader(new
> GZIPInputStream(new FileInputStream(aFile)));
> } else {
>     input = new BufferedReader(new FileReader(aFile));
> }
> 
> Then I parse the contents similarly to the Python version (while
> nextLine = input.readLine...)
> 
> The Java version of this code is roughly 2x-3x faster than the Python
> version.  I can get around this problem by replacing the Python
> GzipFile object with a os.popen call to gzcat, but then I sacrifice
> portability.  Is there something that can be improved in the Python
> version?

The gzip module is implemented in Python on top of the zlib module.  If
you peruse its source (particularly the readline() method of the GzipFile
class) you might get an idea of what's going on.

popen()ing a gzcat source achieves better performance by shifting the
decompression to an asynchronous execution stream (separate process)
while allowing the standard Python file object's optimised readline()
implementation (in C) to do the line splitting (which is done in Python
code in GzipFile).

I suspect that Java approach probably implements a similar approach
under the covers using threads.

Short of rewriting the gzip module in C, you may get some better
throughput by using a slightly lower level approach to parsing the file:

    while 1:
        line = z.readline(size=4096)
        if not line:
            break
        ...  # process line here

This is probably only likely to be of use for files (such as log files)
with lines longer that the 100 character default in the readline()
method.  More intricate approaches using z.readlines(sizehint=<size>)
might also work.

If you can afford the memory, approaches that read large chunks from the
gzipped stream then line split in one low level operation (so that the
line splitting is mostly done in C code) are the only way to lift
performance.

To me, if the performance matters, using popen() (or better: the
subprocess module) isn't so bad; it is actually quite portable
except for the dependency on gzip (probably better to use "gzip -dc"
rather than "gzcat" to maximise portability though).  gzip is available
for most systems, and the approach is easily modified to use bzip2 as
well (though Python's bz2 module is implemented totally in C, and so
probably doesn't have the performance issues that gzip has).

-------------------------------------------------------------------------
Andrew I MacIntyre                     "These thoughts are mine alone..."
E-mail: andymac at bullseye.apana.org.au  (pref) | Snail: PO Box 370
        andymac at pcug.org.au             (alt) |        Belconnen ACT 2616
Web:    http://www.andymac.org/               |        Australia



More information about the Python-list mailing list