number of different lines in a file

Thu May 18 18:32:09 EDT 2006

> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.

A few ideas:

1)  the shell way:

bash$ sort file.in | uniq | wc -l

This doesn't strip whitespace...a little sed magic would 
strip off whitespace for you:

bash$ sed 'regexp' file.in | sort | uniq | wc -l

where 'regexp' is something like this atrocity

's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*$/\1/'

(If your sed supports "\s" and "\S" for "whitespace" and 
"nonwhitespace", it makes the expression a lot less hairy:

's/^\s*\(\(\s*\S\S*\)*\)\s*$/\1/'

and, IMHO, a little easier to read.  There might be a 
nice/concise perl one-liner for this too)

2)  use a python set:

	s = set()
	for line in open("file.in"):
		s.add(line.strip())
	return len(s)

3)  compact #2:

return len(set([line.strip() for line in file("file.in")]))

or, if stripping the lines isn't a concern, it can just be

return len(set(file("file.in")))

The logic in the set keeps track of ensuring that no 
duplicates get entered.

Depending on how many results you *expect*, this could 
become cumbersome, as you have to have every unique line in 
memory.  A stream-oriented solution can be kinder on system 
resources, but would require that the input be sorted first.

-tkc