List of Numbers
John Machin
sjmachin at lexicon.net
Sat Apr 5 18:39:40 EST 2003
Simon Faulkner <news at titanic.co.uk> wrote in message news:<1lau8vcctfn4g2pi9k5df3km7qgg5r72sv at 4ax.com>...
> I have a list of about 5000 numbers in a text file - up to 14 digits
> each.
>
> I need to check for duplicates.
>
> What would people suggest as a good method?
>
> Simon
OTTOMH, untested, caveat lector, YMMV, etc etc:
Methods based on sorting:
1. On Unix or equivalent:
sort <thelist.txt | uniq -c | grep "^1[^0-9]"
[there may be an option for uniq to output only the dups -- RTFM]
2. suck the data into a spreadsheet program, sort the column (A), in
cell B1 put a formula =if(A1=A2,"***","") [Excel notation], propagate
that down column B
3. seeing you did ask this on c.l.py:
alist = file("thelist.txt").readlines()
alist.sort()
for k in xrange(len(alist)-1):
if alist[k+1] == alist[k]:
print "Dup", alist[k+1],
Method based on hashing:
seen = {}
for line in file("thelist.txt"):
if line in seen:
print "Dup", line,
seen[line] = 1
WARNING: you don't say whether your numbers are real or integer. If
they are real, then only method 2 has any chance of pointing out that
"123", "123.", "1.23E2" and "123.00" are duplicates [if indeed you
regard them as duplicates]. The other methods are based on exact
textual comparison [including, in the Python methods, the "\n"!!]
So you might want to do a variant of method 3:
tolerance = 0.00001 # suit yourself
alist = []
for line in file("thelist.txt"):
anum = float(line.strip())
alist.append(anum)
alist.sort()
for k in xrange(len(alist)-1):
if abs(alist[k+1] - alist[k]) < tolerance:
print "Dups", alist[k], alist[k+1]
HTH,
John
More information about the Python-list
mailing list