[Tutor] Re: What is the best way to count the number of lines in a huge file?

Christopher Smith csmith@blakeschool.org
Sat, 08 Sep 2001 09:36:26 -0500


>On Thu, 6 Sep 2001, dman wrote:
>
>> On Thu, Sep 06, 2001 at 09:07:08AM -0400, Ignacio Vazquez-Abrams wrote:
>> | Fair enough:
>> |
>> | ---
>> | a=None
>> | n=0
>> | while not a='':
>> |   a=file.read(262144)
>> |   n+=a.count(os.linesep)
>> | ---
>>
>> The only problem with this is it only (truly properly) counts the
>> lines of files that were created with the same OS as the one
>> counting.
>>
>> You'd probably want to use a regex to search for "\r\n|\r|\n", but it
>> all depends on the source(s) of the files you want to count.
>>
>> Make your script "good enough", not "perfect according to some misty
>> definition".  :-).
>>
>> -D
>
>Even using that RE (and in fact, of using the given code on Windows) is
>that
>the \r\n may straddle a 256k boundary, losing one line in the code, and
>gaining one with the RE.


Fair enough again...how about this to count the number occurances of
separators
(or anything else) in a file:

def countInFile(f,sep):
	'''Count the number of occurance of sep that occur in f and return
	a tuple (count,x) where x is 1 if the last chunk read from f did not
	contain sep.  This might be useful if you are counting line endings
	in f and the last line of f does not have an explicit sep at the end; 
	in this case the calling script should add x to the count.'''
	#
	# Notes:
	#  whatever is passed in as f must have a 'read' method
	#   that returns '' when there is no more to read.
	#  make chunksize as large as you can on your system
	#
	chunksize=262144
	sepl=len(sep)
	last=''
	count=0
	more=f.read(chunksize)
	while more:
		chunk=last+more
		count+=chunk.count(sep)
		last=chunk[-sepl:] #there might be a split sep in here
		if last==sep:      #nope, just a whole one that we already counted
			last=''
		more=f.read(chunksize)
	
	if last<>'':
		x=1
	else:
		x=0
	
	return count,x

/c