[Tutor] Re: What is the best way to count the number of lines in a huge file?
Christopher Smith
csmith@blakeschool.org
Sat, 08 Sep 2001 12:21:38 -0500
I just got chewed out by Ignacio for writing
more=f.read(stuff)
while more
process more
more=f.read(stuff)
and it was suggested that I write
# suggestion 1
more=None
while not more=='':
process more
more=f.read(stuff)
I consider myself wounded by a friend :-) How to handle
this construct in the "one right way" has bothered me and
even bothered me as I struggled with what to send in this
morning, Ignacio. I thought of two other approaches:
# suggestion 2
while 1:
more=f.read(stuff)
if more=='':
break
process more
# suggestion 2b
while 1:
more=f.read(stuff)
if more<>'':
process more
else:
break
and
# suggestion 3
more='go'
while more<>'':
more=f.read(stuff)
if more<>'':
process more
I now consider #2 to be the best; in #1 and #3 you are setting a flag
which must be something not equal to the terminating flag, though in
both cases you can clearly see that the loop will initiate, there is
a chance of making a mistake on the initialization. In addition #3
is a bit repugnant in that you have a test repeated twice (and, for
what got me chewed out in the first place, I think this double test
is prone to error).
I prefer (and wouldn't mind comment) on proposal #2. Here's what it
has going for it:
-it's obvious the loop will start
-it's soon obvious what will stop the loop
-the stop condition and request for more data to process
occurs only once in the loop
-it's better than 2b b/c in 2b the "else" loop is too far
away (if the process code is long)
-it's better not to use an "else" part to reduce the amount
of indentation that must be done.
So...here is the updated lineCount incorporating this change and the
suggestion that whether or not to count the trailing line is specified
as an input option rather than being returned as count information.
Thanks for the constructive criticism :-)
/c
####
def lineCount(f,sep,countTrailer=1):
'''Return a count of the number of occurance of sep that occur in f.
By default this routine assumes that sep indicates the end of a line
and that if the last line doesn't end in sep it should be counted anyway.
To get a strict count of the occurances of sep send a 0 for the 3rd
argument.'''
#
# Notes:
# whatever is passed in as f must have a 'read' method
# that returns '' when there is no more to read.
# make chunksize as large as you can on your system
#
chunksize=262144
sepl=len(sep)
last=''
count=0
while 1:
more=f.read(chunksize)
if more=='':
break
chunk=last+more
count+=chunk.count(sep)
last=chunk[-sepl:] #there might be a split sep in here
if last==sep: #nope, just a whole one that we already counted
last=''
if last<>'' and countTrailer==1:
count+=1
return count