[Python-Dev] RE: [Patches] [Patch #102915] xreadlines : readlines :: xrange : range
Tim Peters
tim.one@home.com
Tue, 2 Jan 2001 23:33:29 -0500
[Guido, writes a timing program]
[Jeff, if you weren't copied on all this stuff, you can play catch-up
by reading the archives, at
http://mail.python.org/pipermail/python-dev/
]
> ...
> I am including the timer program below my signature. The test input
> was the current access_log of dinsdale.python.org, which has about 119
> Mbytes and 1M lines (as counted by the test program).
For a contrast, I cobbled together a large test file out of various chunks
of C source, .py source, HTML source, and email archives. I was shooting
for the same size you used (~119Mb), but ended up with more than 3x as many
lines.
> I measure about a factor of 2 between readlines with a sizehint (of 1
> MB) and fileinput;
Factor of 7 here (Jeff, NeilS eventually figured out that Guido was using a
CVS version of Python that has AndrewK's glibc getline patch, a zippier
line-input routine than Python 2.0 has; but it only applies to platforms
using glibc).
> ...
> Output (the first time is realtime seconds, the second CPU seconds):
>
> total 119808333 chars and 1009350 lines
> count_chars_lines 7.944 7.890
> readlines_sizehint 5.375 5.320
> using_fileinput 15.861 15.740
> while_readline 8.648 8.570
>
> This was on a 600 MHz Pentium-III Linux box (RH 6.2).
total 117615824 chars and 3237568
count_chars_lines 14.780 14.772
readlines_sizehint 9.390 9.375
using_fileinput 66.130 66.157
while_readline 30.380 30.337
866 MHz P3 Win98SE, current CVS Python. I have no handy explanation for why
clock() and time() differ on my box (Win98 has no notions of "user time" or
"CPU time" distinct from clock time).
> Note that count_chars_lines and readlines_sizehint use the same
> algorithm -- the difference is that readlines_sizehint uses 'pass' as
> the inner loop body, while count_chars_lines adds two counters.
>
> Given that very light per-line processing (counting lines and
> characters) already increases the time considerably, I'm not sure I
> buy the arguments that the I/O overhead is always considerable.
I disagree that this is "very light processing", although I agree it's hard
to think of lighter processing <wink>: it's a few Python statements per
line, which I'd say is pretty *typical* processing. Read a line, run a
string find or regexp search on it, test the result, sometimes fiddle the
line accordingly and sometimes not. File-crunching apps generally aren't
rocket science! For example, I changed count_chars_lines to tally the
number of lines containing the string "Guido" instead, and the runtime went
up by just 0.8 seconds (BTW, it found 13808 of them <wink>): if you're
thinking in C terms, millions of failing searches for "Guido" may seem like
more work, but the number of Python stmts executed usually counts more than
what the stmts do at the C level.
> ...
> Now what to do? I still don't like xreadlines very much, but I do
> see that it can save some time. But my test doesn't confirm Neel's
> times as posted by Tim:
>
>> Slowest: for line in fileinput.input('foo'): # Time 100
>> : while 1: line = file.readline() # Time 75
>> : for line in LinesOf(open('foo')): # Time 25
>> Fastest: for line in file.readlines(): # Time 10
>> while 1: lines = file.readlines(hint) # Time 10
>> for line in xreadlines(file): # Time 10
>
> I only see a factor of 3 between fastest and slowest, and
> readline is only about 60% slower than readlines_sizehint.
I don't know what Neel used for an input file, or which platform he used
either. And this is bound to vary a lot across platforms. As above, I saw
a factor of 7 between fastest and slowest and a factor of 3 between readline
and readlines_sizehint.
BTW, on my platform the Perl script (using a recent ActiveState Windows
Perl)
open(FILE, "ga.txt");
while (<FILE>) {
1;
}
ran in about 6 seconds (I never figured how to get Perl to compute usable
timings itself)-- substantially faster than even readlines_sizehint! --and
changing the body to
$nc = $nl = 0;
while (<FILE>) {
++$nl;
$nc += length;
}
print "$nc $nl\n";
boosted that to about 8 seconds. So Perl has gotten zippier too over the
years.