Lazy file.readlines()?
Neil Schemenauer
nascheme at ucalgary.ca
Wed Sep 15 19:21:02 EDT 1999
Fredrik Lundh <fredrik at pythonware.com> wrote:
>Hrvoje Niksic <hniksic at srce.hr> wrote:
>> > - reading a file one line at a time (self.__fp.readline())
>>
>> I don't see an alternative to this, except to read the whole file at
>> once, which I am trying to avoid, as the files are large.
>
>note that:
>
> lines = fp.readlines(16384)
> if not lines:
> break
> for line in lines:
> ...
>
>is usually much faster than
>
> line = fp.readline()
> if not line:
> break
> ...
What would be cool is if readlines() returned a lazy sequence
object (ie. only read as much is needed using a certain block
size). This should give the advantages of readlines() without
the concern about sucking up a huge file all at once.
I implemented this idea (probably badly) in pure Python
and got about a 2x speedup verses readline(). It is a small
module so I will post it here.
being-lazy-has-its-advantages'ly Neil
import string
class BlockFile:
def __init__(self, file, blocksize=1024*40, sep='\n'):
self.file = file
self.blocksize = blocksize
self.sep = sep
self.line = -1
self.lines = []
self.end = ''
def __getitem__(self, i):
try:
self.line = self.line+1
return self.lines[self.line]
except IndexError:
self.line = 0
self._get_block()
return self.lines[0]
def _get_block(self):
data = self.file.read(self.blocksize)
if len(data) == 0:
raise IndexError
self.lines = string.split(data, self.sep)
self.lines[0] = self.lines[0] + self.end
if len(data) == self.blocksize:
self.end = self.lines[-1]
del self.lines[-1] # this _should_ be fast
else:
self.end = ''
def test_block(input):
for l in BlockFile(open(input)):
pass
def test_normal(input):
f = open(input)
while 1:
l = f.readline()
if not l:
break
def measure(function, *args):
import time
t1 = time.time()
apply(function, args)
t2 = time.time()
apply(function, args)
return t2-t1
if __name__ == '__main__':
import sys
print 'block time', measure(test_block, sys.argv[1])
print 'normal time', measure(test_normal, sys.argv[1])
More information about the Python-list
mailing list