List of Numbers

Sat Apr 5 18:39:40 EST 2003

Simon Faulkner <news at titanic.co.uk> wrote in message news:<1lau8vcctfn4g2pi9k5df3km7qgg5r72sv at 4ax.com>...
> I have a list of about 5000 numbers in a text file - up to 14 digits
> each.
> 
> I need to check for duplicates.
> 
> What would people suggest as a good method?
> 
> Simon

OTTOMH, untested, caveat lector, YMMV, etc etc:

Methods based on sorting:

1. On Unix or equivalent:
sort <thelist.txt | uniq -c | grep "^1[^0-9]"
[there may be an option for uniq to output only the dups -- RTFM]

2. suck the data into a spreadsheet program, sort the column (A), in
cell B1 put a formula =if(A1=A2,"***","") [Excel notation], propagate
that down column B

3. seeing you did ask this on c.l.py:

alist = file("thelist.txt").readlines()
alist.sort()
for k in xrange(len(alist)-1):
   if alist[k+1] == alist[k]:
      print "Dup", alist[k+1],

Method based on hashing:

seen = {}
for line in file("thelist.txt"):
   if line in seen:
      print "Dup", line,
   seen[line] = 1

WARNING: you don't say whether your numbers are real or integer. If
they are real, then only method 2 has any chance of pointing out that
"123", "123.", "1.23E2" and "123.00" are duplicates [if indeed you
regard them as duplicates]. The other methods are based on exact
textual comparison [including, in the Python methods, the "\n"!!]

So you might want to do a variant of method 3:

tolerance = 0.00001 # suit yourself
alist = []
for line in file("thelist.txt"):
   anum = float(line.strip())
   alist.append(anum)
alist.sort()
for k in xrange(len(alist)-1):
   if abs(alist[k+1] - alist[k]) < tolerance:
      print "Dups", alist[k], alist[k+1]

HTH,
John