[FEEDBACK] Is this script efficient...is there a better way?

Oren Tirosh oren-py-l at hishome.net
Thu Sep 12 06:02:14 EDT 2002

On Thu, Sep 12, 2002 at 02:31:44AM -0700, David LeBlanc wrote:
> <snip>
> > Have you actually tested this? I used to believe that larger buffers are
> > always better for performance.
> >
> > Wrong.
> >
> > I ran some test on Linux for the effect of buffer size on file reading
> > speed and the results were very interesting. I started with a buffer size
> > of 32 bytes, tested file I/O throughput and increased it
> > logarithmically by
> > about 2% for each step.  As expected, the time to read 1mb improved as the
> > buffer size increased until it hit a minimum for a buffer size of around
> > 4-8k (the graph is very noisy so it's hard to tell) and then rose back up
> > to a value that is 10-20% worse for buffer sizes of 32-64k and remained
> > more-or-less constant for anything higher.
> >
> > This performance curve may be a result of the CPU cache or the OS
> > architecture.
> >
> > The chunk size used by xreadlines is 8k which is about the optimum
> > value (at least for linux).
> >
> > 	Oren
> Disk cluster size? Fragmentation?

Irrelevant. The file was entirely cached in memory. This means that the
test was affected only by the CPU, caches, main memory bank and the way the 
OS architecture interacts with them.

If the file isnt't already in memory there may be some advantage to using 
larger buffers - the disk readahead size used by the OS is a compromise 
between reading more to improve performance and reading less to avoid 
overhead in case the data wasn't needed after all. In this case the 
performance curve might not have the dip in the 4-8k area and keep improving 
with buffer size. Even so, it should rapidly come very close to its
asymptotic limit. I doubt that a buffer size over 32-64k would make any
difference. Slurping 30mb into memory only makes sense if you need to 
access it multiple times. For a single-pass operation there will always be 
a pretty small buffer size that will have equivalent or better performance. 


More information about the Python-list mailing list