Python too slow for real world
Skip Montanaro
skip at mojam.com
Fri Apr 23 18:41:19 EDT 1999
Arne Mueller wrote:
> However the problem of reading/writing larges files line by
> line is the source of slowing down the whole process.
>
> def rw(input, output):
> while 1:
> line = input.readline()
> if not line: break
> output.write(line)
>
> f = open('very_large_file','r')
> rw(f, stdout)
>
> The file I read in contains 2053927 lines and it takes 382 sec to
> read/write it where perl does it in 15 sec.
I saw a mention of using readlines with a buffer size to get the
benefits of large reads without requiring that you read the entire file
into memory at once. Here's a concrete example. I use this idiom
(while loop over readlines() and a nested for loop processing each line)
all the time for processing large files that I don't need to have in
memory all at once.
The input file, /tmp/words2, was generated from /usr/dict/words:
sed -e 's/\(.*\)/\1 \1 \1 \1 \1/' < /usr/dict/words > /tmp/words
cat /tmp/words /tmp/words /tmp/words /tmp/words /tmp/words >
/tmp/words2
It's not as big as your input file (10.2MB, 227k lines), but still big
enough to measure differences. The script below prints (on the second
of two runs to make sure the file is in memory)
68.9596179724
7.96663999557
suggesting about a 8x speedup between your original function and my
readlines version. It's still not going to be as fast as Perl, but it's
probably close enough that some other bottleneck will probably pop up
now...
import sys, time
def rw(input, output):
while 1:
line = input.readline()
if not line: break
output.write(line)
f = open('/tmp/words2','r')
devnull = open('/dev/null','w')
t = time.time()
rw(f, devnull)
print time.time() - t
def rw2(input, output):
lines = input.readlines(100000)
while lines:
output.writelines(lines)
lines = input.readlines(100000)
f = open('/tmp/words2','r')
t = time.time()
rw2(f, devnull)
print time.time() - t
Cheers,
--
Skip Montanaro | Mojam: "Uniting the World of Music"
http://www.mojam.com/
skip at mojam.com | Musi-Cal: http://www.musi-cal.com/
518-372-5583
More information about the Python-list
mailing list