How to count lines in a text file ?

Alex Martelli aleaxit at yahoo.com
Wed Sep 22 15:17:01 EDT 2004


Bengt Richter <bokr at oz.net> wrote:
   ...
> >memory at once.  If you must be able to deal with humungoug files, too
> >big to fit in memory at once, try something like:
> >
> >numlines = 0
> >for line in open('text.txt'): numlines += 1
> 
> I don't have 2.4

2.4a3 is freely available for download and everybody's _encouraged_ to
download it and try it out -- come on, don't be the last one to!-)

> but how would that compare with a generator expression like (untested)
> 
>     sum(1 for line in open('text.txt'))
> 
> or, if you _are_ willing to read in the whole file,
> 
>     open('text.txt').read().count('\n')

I'm not on the same machine as when I ran the other timing measurements
(including pyrex &c) but here's the results on this one machine...:

$ wc /usr/share/dict/words
  234937  234937 2486825 /usr/share/dict/words
$ python2.4 ~/cb/timeit.py "numlines=0
for line in file('/usr/share/dict/words'): numlines+=1"
10 loops, best of 3: 3.08e+05 usec per loop
$ python2.4 ~/cb/timeit.py
"file('/usr/share/dict/words').read().count('\n')" 
10 loops, best of 3: 2.72e+05 usec per loop
$ python2.4 ~/cb/timeit.py
"len(file('/usr/share/dict/words').readlines())"
10 loops, best of 3: 3.25e+05 usec per loop
$ python2.4 ~/cb/timeit.py "sum(1 for line in
file('/usr/share/dict/words'))" 
10 loops, best of 3: 4.42e+05 usec per loop

Last but not least...:

$ python2.4 ~/cb/timeit.py -s'import cou'
"cou.cou(file('/usr/share/dict/words'))"
10 loops, best of 3: 2.05e+05 usec per loop

where cou.pyx is the pyrex program I've already shown on the other
subthread.  Using the count.c I've also shown takes 2.03e+05 usec.
(Can't try psyco here, not an intel-like cpu).


Summary: "sum(1 for ...)" is no speed daemon; the plain loop is best
among the pure-python approaches for files that can't fit in memory.  If
the file DOES fit in memory, read().count('\n') is faster, but
len(...readlines()) is slower.  Pyrex rocks, essentially removing the
need for C-coded extensions (less than a 1% advantage) -- and so does
psyco, but not if you're using a Mac (quick, somebody gift Armin Rigo
with a Mac before it's too late...!!!).


Alex



More information about the Python-list mailing list