xreadlines (was Re: while true: !!!)

Fri Dec 15 07:58:25 EST 2000

"Neelakantan Krishnaswami" <neelk at alum.mit.edu> wrote in message
news:slrn93jagg.fl.neelk at alum.mit.edu...
> On Thu, 14 Dec 2000 11:14:38 +0100, Alex Martelli <aleaxit at yahoo.com>
wrote:
> >> stdin. I've used fileinput to go through big lists of files (10,000+
email
> >> messages) and it works great. It doesn't appear to do any buffering
> >> itself--it uses file.readline() to read the files.
> >
> > If this was a performance problem, it could of course also be fixed
> > in a future fileinput version without changing code that uses it (again,
> > in-place-rewriting would probably have to inhibit the optimization,
> > although that isn't entirely clear).
>
> While it's true that fileinput is somewhat slow, but it can easily be
> made faster than the usual while 1: loop everyone uses.(Relatively

Yep, exactly my point -- and the chunking of readlines was exactly
what I had in mind here as the performance-fix... thanks for doing
the actual work, which I lazily skipped!  With a larger buffer and
more streamlined __getitem__ (no error-checking, optimization for
the most-frequent case) I can get to roughly 1/2 of readlines()...:

class LinesOf:
    def __init__(self, file, chunkSize=256*1024):
        self.file = file
        self.chunkSize = chunkSize
        self.start = 0
        self.refill()
    def refill(self):
        self.data = self.file.readlines(self.chunkSize)
    def __getitem__(self, i):
        try: return self.data[i-self.start]
        except IndexError:
            self.start += len(self.data)
            self.refill()
            if not self.data: raise IndexError
            return self.data[i-self.start]

import time

def withReadlines(file):
    start = time.clock()
    i = 0
    bytes = 0
    for line in file.readlines():
        #i+=1
        #bytes+=len(line)
        pass
    stend = time.clock()
    return i, bytes, stend-start

def withLinesOf(file):
    start = time.clock()
    i = 0
    bytes = 0
    for line in LinesOf(file):
        #i+=1
        #bytes+=len(line)
        pass
    stend = time.clock()
    return i, bytes, stend-start

def test(filename):
    file=open(filename)
    print withReadlines(file)
    file.close()

    file=open(filename)
    print withLinesOf(file)
    file.close()

if __name__=='__main__':
    import sys
    try: filename = sys.argv[1]
    except IndexError: filename = 'aaa.py'
    test(filename)

The operations in the for-loops are commented out to ensure
we're not timing them too -- decommenting them helps ensure
that LinesOf is actually working, and gives an idea of the
magnitude of the test:

D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(6072, 444748, 0.075308712333910496)
(6072, 444748, 0.14522679691782142)

the 'Finding of Facts' HTML file from Microsoft's antitrust
cause -- over the chunksize, to ensure refilling is exercised.

OK, without the operations in the loop, we have:

D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(0, 0, 0.052779039576527305)
(0, 0, 0.10314101285470281)

D:\PySym>python aaa.py \winnt\profiles\martelli\personal\findin~1.htm
(0, 0, 0.052812563380942722)
(0, 0, 0.1092364785925366)

Best to run it twice to ensure against cache effects -- first time
I ran it, LinesOf appeared *FASTER*... because it's run _after_
readlines, so it benefited from OS caching of the file!-).

Alex