[Python-Dev] RE: [Patches] [Patch #102915] xreadlines : readlines :: xrange : range

Guido van Rossum guido@digicool.com
Tue, 02 Jan 2001 09:56:40 -0500


Tim's almost as good at convincing me as he is at channeling me!  The
timings he showed almost convinced me that fileinput is hopeless and
xreadlines should be added.  But then I wrote a little timer of my
own...

I am including the timer program below my signature.  The test input
was the current access_log of dinsdale.python.org, which has about 119
Mbytes and 1M lines (as counted by the test program).

I measure about a factor of 2 between readlines with a sizehint (of 1
MB) and fileinput; a change to fileinput that
uses readline with a sizehint and in-lines the common case in
__getitem__ (as suggested by Moshe), didn't make a difference.

Output (the first time is realtime seconds, the second CPU seconds):

total 119808333 chars and 1009350 lines
count_chars_lines     7.944  7.890
readlines_sizehint    5.375  5.320
using_fileinput      15.861 15.740
while_readline        8.648  8.570

This was on a 600 MHz Pentium-III Linux box (RH 6.2).

Note that count_chars_lines and readlines_sizehint use the same
algorithm -- the difference is that readlines_sizehint uses 'pass' as
the inner loop body, while count_chars_lines adds two counters.

Given that very light per-line processing (counting lines and
characters) already increases the time considerably, I'm not sure I
buy the arguments that the I/O overhead is always considerable.  The
fact that my change to fileinput.py didn't make a difference suggests
that its lack of speed it purely caused by the Python code.

Now what to do?  I still don't like xreadlines very much, but I do see
that it can save some time.  But my test doesn't confirm Neel's times
as posted by Tim:

> Slowest: for line in fileinput.input('foo'):     # Time 100
>        : while 1: line = file.readline()         # Time 75
>        : for line in LinesOf(open('foo')):       # Time 25
> Fastest: for line in file.readlines():           # Time 10
>          while 1: lines = file.readlines(hint)   # Time 10
>          for line in xreadlines(file):           # Time 10

I only see a factor of 3 between fastest and slowest, and
readline is only about 60% slower than readlines_sizehint.

--Guido van Rossum (home page: http://www.python.org/~guido/)

import time, fileinput, sys

def timer(func, *args):
    t0 = time.time()
    c0 = time.clock()
    func(*args)
    t1 = time.time()
    c1 = time.clock()
    print "%-20s %6.3f %6.3f" % (func.__name__, t1-t0, c1-c0)

def count_chars_lines(fn, bs=1024*1024):
    nl = 0
    nc = 0
    f = open(fn, "r")
    while 1:
        buf = f.readlines(bs)
        if not buf:
            break
        for line in buf:
            nl += 1
            nc += len(line)
    f.close()
    print "total", nc, "chars and", nl, "lines"

def readlines_sizehint(fn, bs=1024*1024):
    f = open(fn, "r")
    while 1:
        buf = f.readlines(bs)
        if not buf:
            break
        for line in buf:
            pass
    f.close()

def using_fileinput(fn):
    f = fileinput.FileInput(fn)
    for line in f:
        pass
    f.close()

def while_readline(fn):
    f = open(fn, "r")
    while 1:
        line = f.readline()
        if not line:
            break
        pass
    f.close()

fn = "/home/guido/access_log"
if sys.argv[1:]:
    fn = sys.argv[1]
timer(count_chars_lines, fn)
timer(readlines_sizehint, fn, 1024*1024)
timer(using_fileinput, fn)
timer(while_readline, fn)