Efficiently removing duplicate rows from a 2-dimensional Numeric array

Alex Mont t-alexm at windows.microsoft.com
Fri Jul 20 01:35:12 CEST 2007

I have a 2-dimensional Numeric array with the shape (2,N) and I want to
remove all duplicate rows from the array. For example if I start out






I want to end up with





(Order of the rows doesn't matter, although order of the two elements in
each row does.)


The problem is that I can't find any way of doing this that is efficient
with large data sets (in the data set I am using, N > 1000000)

The normal method of removing duplicates by putting the elements into a
dictionary and then reading off the keys doesn't work directly because
the keys - rows of Python arrays - aren't hashable.

The best I have been able to do so far is:


def remove_duplicates(x):

                d = {}

                for (a,b) in x:

                                d[(a,b)] = (a,b)

                return array(x.values())


According to the profiler the loop takes about 7 seconds and the call to
array() 10 seconds with N=1,700,000.


Is there a faster way to do this using Numeric?


-Alex Mont

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20070719/af5147f5/attachment.html>

More information about the Python-list mailing list