xreadlines (was Re: while true: !!!)
Neelakantan Krishnaswami
neelk at alum.mit.edu
Thu Dec 14 23:17:45 EST 2000
On Thu, 14 Dec 2000 11:14:38 +0100, Alex Martelli <aleaxit at yahoo.com> wrote:
>> stdin. I've used fileinput to go through big lists of files (10,000+ email
>> messages) and it works great. It doesn't appear to do any buffering
>> itself--it uses file.readline() to read the files.
>
> If this was a performance problem, it could of course also be fixed
> in a future fileinput version without changing code that uses it (again,
> in-place-rewriting would probably have to inhibit the optimization,
> although that isn't entirely clear).
While it's true that fileinput is somewhat slow, but it can easily be
made faster than the usual while 1: loop everyone uses.(Relatively
speaking -- if you have the RAM, everything loses vis-a-vis
readlines().) Concretely, here's a wrapper class that's *faster* than
the traditional while 1: loop, because it does uses readlines sizehint
trick that everyone on c.l.p describes but no one ever seems to use in
practice.
class Pita:
"""Pita(file)
This class wraps a file-like object around an iterator so you can
loop over a file's lines with a for loop. Eg:
>>> for line in Pita(open('foo.txt')):
... print line[:-1]
...
"""
def __init__(self, file, hint=16384):
self.file = file
self.current = 0 # Current line number
self.chunk = hint # Chunk size to read
self.buf = self.file.readlines(self.chunk) # A buffer of lines
self.i = 0 # index into buffer
def __getitem__(self, n):
if n == self.current:
self.current = self.current + 1
else:
raise KeyError, "Attempt to read stream out of order"
#
# If we've reached the end of the buffer, grab a new chunk
#
if self.i == len(self.buf):
self.buf = self.file.readlines(self.chunk)
self.i = 0
#
# An empty buffer -> No more lines in the file
#
if self.buf:
line = self.buf[self.i]
self.i = self.i + 1
return line
else:
raise IndexError
Some timing results on a roughly 5.5 MB text file (Python 1.5.2;
I haven't upgraded yet.)
>>> test('/home/neelk/INBOX')
fileinput = 3.565142
while 1: = 2.653460
Pita = 1.128891
readlines() = 0.337859
The Pita class is almost three times as fast as FileInput and twice as
fast as while 1:, and readlines() is almost four times as fast as
that. FileInput is the slowest of the lot -- but it looks like using
the sizehint trick could bring it up to Pita speed.
Here's the test driver:
def test(filename):
t1 = time.time()
for i in fileinput.FileInput(filename):
pass
t2 = time.time()
print "fileinput = %f" % (t2 - t1)
#
f = open(filename)
t1 = time.time()
while 1:
line = f.readline()
if not line: break
t2 = time.time()
f.close()
print "while 1: = %f" % (t2 - t1)
#
t1 = time.time()
for i in Pita(open(filename)):
pass
t2 = time.time()
print "Pita = %f" % (t2 - t1)
#
t1 = time.time()
for i in open(filename).readlines():
pass
t2 = time.time()
print "readlines() = %f" % (t2 - t1)
Neel
More information about the Python-list
mailing list