Looking under Python's hood: Will we find a high performance or clunky engine?

Mon Jan 23 04:11:39 EST 2012

On Sun, 22 Jan 2012 07:50:59 -0800, Rick Johnson wrote:

> What does Python do when presented with this code?
> 
> py> [line.strip('\n') for line in f.readlines()]
> 
> If Python reads all the file lines first and THEN iterates AGAIN to do
> the strip; we are driving a Fred flintstone mobile.

Nonsense. File-like objects offer two APIs: there is a lazy iterator 
approach, using the file-like object itself as an iterator, and an eager 
read-it-all-at-once approach, offered by the venerable readlines() 
method. readlines *deliberately* reads the entire file, and if you as a 
developer do so by accident, you have no-one to blame but yourself. Only 
a poor tradesman blames his tools instead of taking responsibility for 
learning how to use them himself.

You should use whichever approach is more appropriate for your situation. 
You might want to consider reading from the file as quickly as possible, 
in one big chunk if you can, so you can close it again and let other 
applications have access to it. Or you might not care. The choice is 
yours.

For small files, readlines() will probably be faster, although for small 
files it won't make much practical difference. Who cares whether it takes 
0.01ms or 0.02ms? For medium sized files, say, a few thousand lines, it 
could go either way, depending on memory use, the size of the internal 
file buffer, and implementation details. Only for files large enough that 
allocating memory for all the lines at once becomes significant will lazy 
iteration be a clear winner.

But if the file is that big, are you sure that a list comprehension is 
the right tool in the first place?

In general, you should not care greatly which of the two you use, unless 
profiling your application shows that this is the bottleneck.

But it is extremely unlikely that copying even a few thousands lines 
around memory will be slower than reading them from disk in the first 
place. Unless you expect to be handling truly large files, you've got 
more important things to optimize before wasting time caring about this.

-- 
Steven