[ python-Bugs-849046 ] gzip.GzipFile is slow

SourceForge.net noreply at sourceforge.net
Tue Nov 25 17:05:16 EST 2003


Bugs item #849046, was opened at 2003-11-25 10:45
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Ronald Oussoren (ronaldoussoren)
Assigned to: Nobody/Anonymous (nobody)
Summary: gzip.GzipFile is slow

Initial Comment:
gzip.GzipFile is significantly (an order of a magnitude) 
slower than using the gzip binary. I've been bitten by this 
several times, and have replaced "fd = gzip.open('somefile', 
'r')" by "fd = os.popen('gzcat somefile', 'r')" on several 
occassions.

Would a patch that implemented GzipFile in C have any 
change of being accepted?

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2003-11-25 17:05

Message:
Logged In: YES 
user_id=764593

In the library, I see a fair amount of work that doesn't really 
do anything except make sure you're getting exactly a line at 
a time.

Would it be an option to just read the file in all at once, split it 
on newlines, and then loop over the list?  (Or read it into a 
cStringIO, I suppose.)

----------------------------------------------------------------------

Comment By: Ronald Oussoren (ronaldoussoren)
Date: 2003-11-25 16:12

Message:
Logged In: YES 
user_id=580910

To be more precise:

$ ls -l gzippedfile
-rw-r--r--  1 ronald  admin  354581 18 Nov 10:21 gzippedfile

$ gzip -l gzippedfile
compressed  uncompr. ratio uncompressed_name
   354581   1403838  74.7% gzippedfile

The file contains about 45K lines of text (about 40 characters/line)

$ time gzip -dc gzippedfile >  /dev/null

real    0m0.100s
user    0m0.060s
sys     0m0.000s

$ python read.py gzippedfile > /dev/null

real    0m3.222s
user    0m3.020s
sys     0m0.070s

$ cat read.py
#!/usr/bin/env python

import sys
import gzip

fd = gzip.open(sys.argv[1], 'r')

ln = fd.readline()
while ln:
    sys.stdout.write(ln)
    ln = fd.readline()


The difference is also significant for larger files (e.g. the 
difference is not caused by the different startup-times)



----------------------------------------------------------------------

Comment By: Ronald Oussoren (ronaldoussoren)
Date: 2003-11-25 16:03

Message:
Logged In: YES 
user_id=580910

The files are created using GzipFile. That speed is acceptable 
because it happens in a batch-job, reading back is the problem 
because that happens on demand and a user is waiting for the 
results.

gzcat is a *uncompress* utility (specifically it is "gzip -dc"), the 
compression level is irrelevant for this discussion. 

The python code seems to do quite some string manipulation, 
maybe that is causing the slowdown (I'm using fd.readline() in a 
fairly tight loop). I'll do some profiling to check what is taking so 
much time.

BTW. I'm doing this on Unix systems (Sun Solaris and Mac OS X).

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2003-11-25 12:35

Message:
Logged In: YES 
user_id=764593

Which compression level are you using?

It looks like most of the work is already done by zlib (which is in C), but GzipFile defaults to compression level 9.  Many other zips (including your gzcat?) default to a lower (but much faster) compression level.  


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470



More information about the Python-bugs-list mailing list