[Pythonmac-SIG] Fast way to get a list of unique values?
Larry Meyn
lmeyn@mail.arc.nasa.gov
Mon, 9 Sep 2002 14:14:01 -0700
--Apple-Mail-2--388249761
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
format=flowed
Here's a function from the Python Cookbook:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560
def unique(s):
"""Return a list of the elements in s, but without duplicates.
For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],
unique("abcabc") some permutation of ["a", "b", "c"], and
unique(([1, 2], [2, 3], [1, 2])) some permutation of
[[2, 3], [1, 2]].
For best speed, all sequence elements should be hashable. Then
unique() will usually work in linear time.
If not possible, the sequence elements should enjoy a total
ordering, and if list(s).sort() doesn't raise TypeError it's
assumed that they do enjoy a total ordering. Then unique() will
usually work in O(N*log2(N)) time.
If that's not possible either, the sequence elements must support
equality-testing. Then unique() will usually work in quadratic
time.
"""
n = len(s)
if n == 0:
return []
# Try using a dict first, as that's the fastest and will usually
# work. If it doesn't work, it will usually fail quickly, so it
# usually doesn't cost much to *try* it. It requires that all the
# sequence elements be hashable, and support equality comparison.
u = {}
try:
for x in s:
u[x] = 1
except TypeError:
del u # move on to the next method
else:
return u.keys()
# We can't hash all the elements. Second fastest is to sort,
# which brings the equal elements together; then duplicates are
# easy to weed out in a single pass.
# NOTE: Python's list.sort() was designed to be efficient in the
# presence of many duplicate elements. This isn't true of all
# sort functions in all languages or libraries, so this approach
# is more effective in Python than it may be elsewhere.
try:
t = list(s)
t.sort()
except TypeError:
del t # move on to the next method
else:
assert n > 0
last = t[0]
lasti = i = 1
while i < n:
if t[i] != last:
t[lasti] = last = t[i]
lasti += 1
i += 1
return t[:lasti]
# Brute force is all that's left.
u = []
for x in s:
if x not in u:
u.append(x)
return u
On Monday, September 9, 2002, at 01:48 PM, Robb Brown wrote:
>
> I have a large array (Numeric) of integers. I'd like to get a list of
> all the unique values in that array. Naturally I'd like to avoid a
> for loop. Is there an easy way to do this in Python?
>
> Thanks,
>
> Robb
> --
> ______________________________
> Robb Brown
> Seaman Family MR Research Centre
> Calgary, Alberta, Canada
>
> _______________________________________________
> Pythonmac-SIG maillist - Pythonmac-SIG@python.org
> http://mail.python.org/mailman/listinfo/pythonmac-sig
>
>
Larry Meyn
Aerospace Operations Modeling Office
M/S 210-10 Phone: (650) 604-5038
NASA Ames Research Center Fax: (650) 604-0222
Moffett Field, CA 94035-1000 E-mail: lmeyn@mail.arc.nasa.gov
E-Fax: (425) 944-5526 sent via e-mail
--Apple-Mail-2--388249761
Content-Transfer-Encoding: 7bit
Content-Type: text/enriched;
charset=US-ASCII
Here's a function from the Python Cookbook:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560
def unique(s):
"""Return a list of the elements in s, but without duplicates.
For example, unique([1,2,3,1,2,3]) is some permutation of [1,2,3],
unique("abcabc") some permutation of ["a", "b", "c"], and
unique(([1, 2], [2, 3], [1, 2])) some permutation of
[[2, 3], [1, 2]].
For best speed, all sequence elements should be hashable. Then
unique() will usually work in linear time.
If not possible, the sequence elements should enjoy a total
ordering, and if list(s).sort() doesn't raise TypeError it's
assumed that they do enjoy a total ordering. Then unique() will
usually work in O(N*log2(N)) time.
If that's not possible either, the sequence elements must support
equality-testing. Then unique() will usually work in quadratic
time.
"""
n = len(s)
if n == 0:
return []
# Try using a dict first, as that's the fastest and will usually
# work. If it doesn't work, it will usually fail quickly, so it
# usually doesn't cost much to *try* it. It requires that all the
# sequence elements be hashable, and support equality comparison.
u = {}
try:
for x in s:
u[x] = 1
except TypeError:
del u # move on to the next method
else:
return u.keys()
# We can't hash all the elements. Second fastest is to sort,
# which brings the equal elements together; then duplicates are
# easy to weed out in a single pass.
# NOTE: Python's list.sort() was designed to be efficient in the
# presence of many duplicate elements. This isn't true of all
# sort functions in all languages or libraries, so this approach
# is more effective in Python than it may be elsewhere.
try:
t = list(s)
t.sort()
except TypeError:
del t # move on to the next method
else:
assert n > 0
last = t[0]
lasti = i = 1
while i << n:
if t[i] != last:
t[lasti] = last = t[i]
lasti += 1
i += 1
return t[:lasti]
# Brute force is all that's left.
u = []
for x in s:
if x not in u:
u.append(x)
return u
On Monday, September 9, 2002, at 01:48 PM, Robb Brown wrote:
<excerpt>
I have a large array (Numeric) of integers. I'd like to get a list of
all the unique values in that array. Naturally I'd like to avoid a
for loop. Is there an easy way to do this in Python?
Thanks,
Robb
--
______________________________
Robb Brown
Seaman Family MR Research Centre
Calgary, Alberta, Canada
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig
</excerpt>
<fontfamily><param>Courier</param>Larry Meyn
Aerospace Operations Modeling Office
M/S 210-10 Phone: (650) 604-5038
NASA Ames Research Center Fax: (650) 604-0222
Moffett Field, CA 94035-1000 E-mail: lmeyn@mail.arc.nasa.gov
E-Fax: (425) 944-5526 sent via e-mail
</fontfamily>
--Apple-Mail-2--388249761--