xreadlines (was Re: while true: !!!)

Thu Dec 14 23:17:45 EST 2000

On Thu, 14 Dec 2000 11:14:38 +0100, Alex Martelli <aleaxit at yahoo.com> wrote:
>> stdin. I've used fileinput to go through big lists of files (10,000+ email
>> messages) and it works great. It doesn't appear to do any buffering
>> itself--it uses file.readline() to read the files.
>
> If this was a performance problem, it could of course also be fixed
> in a future fileinput version without changing code that uses it (again,
> in-place-rewriting would probably have to inhibit the optimization,
> although that isn't entirely clear).

While it's true that fileinput is somewhat slow, but it can easily be
made faster than the usual while 1: loop everyone uses.(Relatively
speaking -- if you have the RAM, everything loses vis-a-vis
readlines().) Concretely, here's a wrapper class that's *faster* than
the traditional while 1: loop, because it does uses readlines sizehint
trick that everyone on c.l.p describes but no one ever seems to use in
practice.

class Pita:
    """Pita(file)

    This class wraps a file-like object around an iterator so you can
    loop over a file's lines with a for loop. Eg:

    >>> for line in Pita(open('foo.txt')):
    ...     print line[:-1]
    ...
    """
    def __init__(self, file, hint=16384):
        self.file = file
        self.current = 0                           # Current line number
        self.chunk = hint                          # Chunk size to read
        self.buf = self.file.readlines(self.chunk) # A buffer of lines
        self.i = 0                                 # index into buffer
    def __getitem__(self, n):
        if n == self.current:
            self.current = self.current + 1
        else:
            raise KeyError, "Attempt to read stream out of order"
        #
        # If we've reached the end of the buffer, grab a new chunk
        #
        if self.i == len(self.buf):
            self.buf = self.file.readlines(self.chunk)
            self.i = 0
        #
        # An empty buffer -> No more lines in the file
        #
        if self.buf: 
            line = self.buf[self.i]
            self.i = self.i + 1
            return line
        else: 
            raise IndexError

Some timing results on a roughly 5.5 MB text file (Python 1.5.2;
I haven't upgraded yet.)

>>> test('/home/neelk/INBOX')
fileinput = 3.565142
while 1: = 2.653460
Pita = 1.128891
readlines() = 0.337859

The Pita class is almost three times as fast as FileInput and twice as
fast as while 1:, and readlines() is almost four times as fast as
that. FileInput is the slowest of the lot -- but it looks like using
the sizehint trick could bring it up to Pita speed.

Here's the test driver:

def test(filename):
    t1 = time.time()
    for i in fileinput.FileInput(filename):
        pass
    t2 = time.time()
    print "fileinput = %f" % (t2 - t1)
    #
    f = open(filename)
    t1 = time.time()
    while 1:
        line = f.readline()
        if not line: break
    t2 = time.time()
    f.close()
    print "while 1: = %f" % (t2 - t1)
    #
    t1 = time.time()
    for i in Pita(open(filename)):
        pass
    t2 = time.time()
    print "Pita = %f" % (t2 - t1)
    #
    t1 = time.time()
    for i in open(filename).readlines():
        pass
    t2 = time.time()
    print "readlines() = %f" % (t2 - t1)

Neel