really slow gzip decompress, why?
Jeff McNeil
jeff at jmcneil.net
Mon Jan 26 11:02:55 EST 2009
On Jan 26, 10:51 am, Jeff McNeil <j... at jmcneil.net> wrote:
> On Jan 26, 10:22 am, redbaron <ivanov.ma... at gmail.com> wrote:
>
> > I've one big (6.9 Gb) .gz file with text inside it.
> > zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds
>
> > python code have been doing the same job for 25 minutes and still
> > doesn't finish =( the code is simpliest I could ever imagine:
>
> > def main():
> > fh = gzip.open(sys.argv[1])
> > all(fh)
>
> > As far as I understand most of the time it executes C code, so pythons
> > no overhead should be noticible. Why is it so slow?
>
> Look what's happening in both operations. The zcat operation is simply
> uncompressing your data and dumping directly to /dev/null. Nothing is
> done with the data as it's uncompressed.
>
> On the other hand, when you call 'all(fh)', you're iterating through
> every element in in bigfile.gz. In other words, you're reading the
> file and scanning it for newlines versus simply running the
> decompression operation.
The File:
----------------------------------------------------
[jeff at marvin ~]$ ls -alh junk.gz
-rw-rw-r-- 1 jeff jeff 113M 2009-01-26 10:42 junk.gz
[jeff at marvin ~]$
The 'zcat' time:
----------------------------------------------------
[jeff at marvin ~]$ time zcat junk.gz > /dev/null
real 0m2.390s
user 0m2.296s
sys 0m0.093s
[jeff at marvin ~]$
Test Script #1:
----------------------------------------------------
import sys
import gzip
fs = gzip.open('junk.gz')
data = fs.read(8192)
while data:
sys.stdout.write(data)
data = fs.read(8192)
Test Script #1 Time:
----------------------------------------------------
[jeff at marvin ~]$ time python test9.py >/dev/null
real 0m3.681s
user 0m3.201s
sys 0m0.478s
[jeff at marvin ~]$
Test Script #2:
----------------------------------------------------
import sys
import gzip
fs = gzip.open('junk.gz')
all(fs)
Test Script #2 Time:
----------------------------------------------------
[jeff at marvin ~]$ time python test10.py
real 1m51.764s
user 1m51.475s
sys 0m0.245s
[jeff at marvin ~]$
More information about the Python-list
mailing list