[issue20962] Rather modest chunk size in gzip.GzipFile

Skip Montanaro report at bugs.python.org
Mon Mar 17 19:58:47 CET 2014


New submission from Skip Montanaro:

I've had the opportunity to use the seek() method of the gzip.GzipFile class for the first time in the past few days. Wondering why it seemed my processing times were so slow, I took a look at the code for seek() and read(). It seems like the chunk size for reading (1024 bytes) is rather small. I created a simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE to be 16 * 8192 bytes (the whole idea of compressing files is that they get large, right? seems like most of the time we will want to seek pretty far through the file).

Over a small subset of my inputs, I measured about a 2x decrease in run times, from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several times, measuring the last four runs (two using the stdlib implementation, two using my subclass). I measured both the total time of the run, the time to process each input records, and time to execute just the seek() call for each record. The bulk of the per-record time was in the call to seek(), so by reducing that time, I sped up my run-times significantly.

I'm still using 2.7, but other than the usual 2.x->3.x changes, the code looks pretty much the same between 2.7 and (at least) 3.3, and the logic involving the read size doesn't seem to have changed at all.

I'll try to produce a patch if I have a few minutes, but in the meantime, I've attached my modified GzipFile class (produced against 2.7).

----------
components: Library (Lib)
files: gzipseek.py
messages: 213883
nosy: skip.montanaro
priority: normal
severity: normal
status: open
title: Rather modest chunk size in gzip.GzipFile
type: performance
versions: Python 3.4, Python 3.5
Added file: http://bugs.python.org/file34466/gzipseek.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20962>
_______________________________________


More information about the Python-bugs-list mailing list