[Python-Dev] RE: [Patches] [Patch #102915] xreadlines : readlines :: xrange : range

Tim Peters tim.one@home.com
Tue, 2 Jan 2001 23:33:29 -0500


[Guido, writes a timing program]

[Jeff, if you weren't copied on all this stuff, you can play catch-up
 by reading the archives, at
    http://mail.python.org/pipermail/python-dev/
]

> ...
> I am including the timer program below my signature.  The test input
> was the current access_log of dinsdale.python.org, which has about 119
> Mbytes and 1M lines (as counted by the test program).

For a contrast, I cobbled together a large test file out of various chunks
of C source, .py source, HTML source, and email archives.  I was shooting
for the same size you used (~119Mb), but ended up with more than 3x as many
lines.

> I measure about a factor of 2 between readlines with a sizehint (of 1
> MB) and fileinput;

Factor of 7 here (Jeff, NeilS eventually figured out that Guido was using a
CVS version of Python that has AndrewK's glibc getline patch, a zippier
line-input routine than Python 2.0 has; but it only applies to platforms
using glibc).

> ...
> Output (the first time is realtime seconds, the second CPU seconds):
>
> total 119808333 chars and 1009350 lines
> count_chars_lines     7.944  7.890
> readlines_sizehint    5.375  5.320
> using_fileinput      15.861 15.740
> while_readline        8.648  8.570
>
> This was on a 600 MHz Pentium-III Linux box (RH 6.2).

total 117615824 chars and 3237568
count_chars_lines    14.780 14.772
readlines_sizehint    9.390  9.375
using_fileinput      66.130 66.157
while_readline       30.380 30.337

866 MHz P3 Win98SE, current CVS Python.  I have no handy explanation for why
clock() and time() differ on my box (Win98 has no notions of "user time" or
"CPU time" distinct from clock time).

> Note that count_chars_lines and readlines_sizehint use the same
> algorithm -- the difference is that readlines_sizehint uses 'pass' as
> the inner loop body, while count_chars_lines adds two counters.
>
> Given that very light per-line processing (counting lines and
> characters) already increases the time considerably, I'm not sure I
> buy the arguments that the I/O overhead is always considerable.

I disagree that this is "very light processing", although I agree it's hard
to think of lighter processing <wink>:  it's a few Python statements per
line, which I'd say is pretty *typical* processing.  Read a line, run a
string find or regexp search on it, test the result, sometimes fiddle the
line accordingly and sometimes not.  File-crunching apps generally aren't
rocket science!  For example, I changed count_chars_lines to tally the
number of lines containing the string "Guido" instead, and the runtime went
up by just 0.8 seconds (BTW, it found 13808 of them <wink>):  if you're
thinking in C terms, millions of failing searches for "Guido" may seem like
more work, but the number of Python stmts executed usually counts more than
what the stmts do at the C level.

> ...
> Now what to do?  I still don't like xreadlines very much, but I do
> see that it can save some time.  But my test doesn't confirm Neel's
> times as posted by Tim:
>
>> Slowest: for line in fileinput.input('foo'):     # Time 100
>>        : while 1: line = file.readline()         # Time 75
>>        : for line in LinesOf(open('foo')):       # Time 25
>> Fastest: for line in file.readlines():           # Time 10
>>          while 1: lines = file.readlines(hint)   # Time 10
>>          for line in xreadlines(file):           # Time 10
>
> I only see a factor of 3 between fastest and slowest, and
> readline is only about 60% slower than readlines_sizehint.

I don't know what Neel used for an input file, or which platform he used
either.  And this is bound to vary a lot across platforms.  As above, I saw
a factor of 7 between fastest and slowest and a factor of 3 between readline
and readlines_sizehint.

BTW, on my platform the Perl script (using a recent ActiveState Windows
Perl)

open(FILE, "ga.txt");
while (<FILE>) {
    1;
}

ran in about 6 seconds (I never figured how to get Perl to compute usable
timings itself)-- substantially faster than even readlines_sizehint! --and
changing the body to

$nc = $nl = 0;
while (<FILE>) {
    ++$nl;
    $nc += length;
}
print "$nc $nl\n";

boosted that to about 8 seconds.  So Perl has gotten zippier too over the
years.