<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv=Content-Type content="text/html; charset=us-ascii">


<meta name=Generator content="Microsoft Word 12 (filtered medium)">


<style>


<!--


 /* Font Definitions */


 @font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


 /* Style Definitions */


 p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0in;


        margin-bottom:.0001pt;


        font-size:11.0pt;


        font-family:"Calibri","sans-serif";}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


a:visited, span.MsoHyperlinkFollowed


        {mso-style-priority:99;


        color:purple;


        text-decoration:underline;}


span.EmailStyle17


        {mso-style-type:personal-compose;


        font-family:"Calibri","sans-serif";


        color:windowtext;}


.MsoChpDefault


        {mso-style-type:export-only;}


@page Section1


        {size:8.5in 11.0in;


        margin:1.0in 1.0in 1.0in 1.0in;}


div.Section1


        {page:Section1;}


-->


</style>


<!--[if gte mso 9]><xml>


 <o:shapedefaults v:ext="edit" spidmax="1026" />


</xml><![endif]--><!--[if gte mso 9]><xml>


 <o:shapelayout v:ext="edit">


  <o:idmap v:ext="edit" data="1" />


 </o:shapelayout></xml><![endif]-->


</head>


<body lang=EN-US link=blue vlink=purple>


<div class=Section1>


<p class=MsoNormal>I have a 2-dimensional Numeric array with the shape (2,N)


and I want to remove all duplicate rows from the array. For example if I start


out with:<o:p></o:p></p>


<p class=MsoNormal>[[1,2],<o:p></o:p></p>


<p class=MsoNormal>[1,3],<o:p></o:p></p>


<p class=MsoNormal>[1,2],<o:p></o:p></p>


<p class=MsoNormal>[2,3]]<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>I want to end up with<o:p></o:p></p>


<p class=MsoNormal>[[1,2],<o:p></o:p></p>


<p class=MsoNormal>[1,3],<o:p></o:p></p>


<p class=MsoNormal>[2,3]].<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>(Order of the rows doesn’t matter, although order of


the two elements in each row does.)<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>The problem is that I can’t find any way of doing this


that is efficient with large data sets (in the data set I am using, N > 1000000)<o:p></o:p></p>


<p class=MsoNormal>The normal method of removing duplicates by putting the


elements into a dictionary and then reading off the keys doesn’t work


directly because the keys – rows of Python arrays – aren’t


hashable.<o:p></o:p></p>


<p class=MsoNormal>The best I have been able to do so far is:<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>def remove_duplicates(x):<o:p></o:p></p>


<p class=MsoNormal>                d


= {}<o:p></o:p></p>


<p class=MsoNormal>                for


(a,b) in x:<o:p></o:p></p>


<p class=MsoNormal>                                d[(a,b)]


= (a,b)<o:p></o:p></p>


<p class=MsoNormal>                return


array(x.values())<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>According to the profiler the loop takes about 7 seconds and


the call to array() 10 seconds with N=1,700,000.<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>Is there a faster way to do this using Numeric?<o:p></o:p></p>


<p class=MsoNormal><o:p> </o:p></p>


<p class=MsoNormal>-Alex Mont<o:p></o:p></p>


</div>


</body>


</html>