number of different lines in a file

Tim Chase python.list at
Thu May 18 18:32:09 EDT 2006

> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.

A few ideas:

1)  the shell way:

bash$ sort | uniq | wc -l

This doesn't strip whitespace...a little sed magic would 
strip off whitespace for you:

bash$ sed 'regexp' | sort | uniq | wc -l

where 'regexp' is something like this atrocity


(If your sed supports "\s" and "\S" for "whitespace" and 
"nonwhitespace", it makes the expression a lot less hairy:


and, IMHO, a little easier to read.  There might be a 
nice/concise perl one-liner for this too)

2)  use a python set:

	s = set()
	for line in open(""):
	return len(s)

3)  compact #2:

return len(set([line.strip() for line in file("")]))

or, if stripping the lines isn't a concern, it can just be

return len(set(file("")))

The logic in the set keeps track of ensuring that no 
duplicates get entered.

Depending on how many results you *expect*, this could 
become cumbersome, as you have to have every unique line in 
memory.  A stream-oriented solution can be kinder on system 
resources, but would require that the input be sorted first.


More information about the Python-list mailing list