number of different lines in a file
Tim Chase
python.list at tim.thechases.com
Thu May 18 18:32:09 EDT 2006
> I have a million-line text file with 100 characters per line,
> and simply need to determine how many of the lines are distinct.
A few ideas:
1) the shell way:
bash$ sort file.in | uniq | wc -l
This doesn't strip whitespace...a little sed magic would
strip off whitespace for you:
bash$ sed 'regexp' file.in | sort | uniq | wc -l
where 'regexp' is something like this atrocity
's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*$/\1/'
(If your sed supports "\s" and "\S" for "whitespace" and
"nonwhitespace", it makes the expression a lot less hairy:
's/^\s*\(\(\s*\S\S*\)*\)\s*$/\1/'
and, IMHO, a little easier to read. There might be a
nice/concise perl one-liner for this too)
2) use a python set:
s = set()
for line in open("file.in"):
s.add(line.strip())
return len(s)
3) compact #2:
return len(set([line.strip() for line in file("file.in")]))
or, if stripping the lines isn't a concern, it can just be
return len(set(file("file.in")))
The logic in the set keeps track of ensuring that no
duplicates get entered.
Depending on how many results you *expect*, this could
become cumbersome, as you have to have every unique line in
memory. A stream-oriented solution can be kinder on system
resources, but would require that the input be sorted first.
-tkc
More information about the Python-list
mailing list