[SciPy-User] Removing duplicate cols/rows

eat e.antero.tammi at gmail.com
Mon Dec 19 20:32:42 EST 2011


Hi,

On Tue, Dec 20, 2011 at 2:58 AM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Mon, Dec 19, 2011 at 2:58 PM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
> >
> >
> > On Mon, Dec 19, 2011 at 1:49 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> On Mon, Dec 19, 2011 at 2:41 PM, eat <e.antero.tammi at gmail.com> wrote:
> >> > Hi,
> >> >
> >> > On Mon, Dec 19, 2011 at 11:59 AM, Sergi Pons Freixes
> >> > <sponsfreixes at gmail.com> wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> I'm using a 2D shape array to store pairs of longitudes+latitudes. At
> >> >> one point, I have to merge two of those 2D arrays, and then remove
> any
> >> >> duplicate entry. I've been searching for a function similar to
> >> >> numpy.unique, but I've had no luck. Any implementation I've been
> >> >> thinking on looks very "unoptimizied". Is there anything existing
> >> >> solution, so I do not reinvent the wheel?
> >> >>
> >> >> To make it clear, I'm looking for:
> >> >> >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
> >> >> >>> unique_rows(a)
> >> >> array([[1, 1], [2, 3],[5, 4]])
> >> >
> >> > A dot product with a random vector may do the trick. like:
> >> > In []: a
> >> > Out[]:
> >> > array([[1, 1],
> >> >        [2, 3],
> >> >        [1, 1],
> >> >        [5, 4],
> >> >        [2, 3]])
> >> > In []: unique_index= np.unique(a.dot(np.random.rand(2)), return_index=
> >> > True)[1]
> >> > In []: a[unique_index]
> >> > Out[]:
> >> > array([[1, 1],
> >> >        [2, 3],
> >> >        [5, 4]])
> >> >
> >> > (and for cols use just transpose of a)
> >> >
> >> >
> >> > My 2 cents,
> >> > eat
> >> >>
> >> >>
> >> >> BTW, I wanted to use just a list of tuples for it, but the lists were
> >> >> so big that they consumed my 4Gb RAM + 4Gb swap (numpy arrays are
> more
> >> >> memory efficient).
> >> >>
> >> >> Regards,
> >> >> Sergi
> >> >> _______________________________________________
> >> >> SciPy-User mailing list
> >> >> SciPy-User at scipy.org
> >> >> http://mail.scipy.org/mailman/listinfo/scipy-user
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > SciPy-User mailing list
> >> > SciPy-User at scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >> >
> >>
> >> I implemented an efficient function for this in pandas:
> >>
> >> In [1]: a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]])
> >>
> >> In [2]: df = DataFrame(a)
> >>
> >> In [3]: df
> >> Out[3]:
> >>   0  1
> >> 0  1  1
> >> 1  2  3
> >> 2  1  1
> >> 3  5  4
> >> 4  2  3
> >>
> >> In [4]: df.drop_duplicates()
> >> Out[4]:
> >>   0  1
> >> 0  1  1
> >> 1  2  3
> >> 3  5  4
> >>
> >> you can get just the ndarray back by df.drop_duplicates().values
> >>
> >> - Wes
> >
> >
> >
> > Or...
> >
> > In [44]: x
> > Out[44]:
> > array([[3, 3],
> >        [3, 2],
> >        [2, 1],
> >        [3, 3],
> >        [1, 2],
> >        [3, 1],
> >        [1, 3],
> >        [1, 1],
> >        [2, 3],
> >        [3, 2],
> >        [1, 1],
> >        [3, 3],
> >        [1, 1],
> >        [3, 2],
> >        [3, 2]])
> >
> > In [45]: u =
> >
> unique(x.view(dtype=dtype([('a',x.dtype),('b',x.dtype)]))).view(x.dtype).reshape(-1,2)
> >
> > In [46]: u
> > Out[46]:
> > array([[1, 1],
> >        [1, 2],
> >        [1, 3],
> >        [2, 1],
> >        [2, 3],
> >        [3, 1],
> >        [3, 2],
> >        [3, 3]])
> >
> >
> > The 'one-liner' above converts x to a 1D structured array with two
> fields,
> > then applies numpy.unique to the 1D array, and then converts that result
> > back to a 2D array.
> >
> > Warren
> >
> >
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
>
Hi,

That is cool. I found it interesting that np.unique is really slow on
> record arrays (the DataFrame method, dict-based under the hood, is
> about 5x faster). Is it doing tuple comparison?
>
np.unique seems to be quite slow indeed. Also the number of columns seems
need to be harcoded.

An slightly off-topic issue is that it doesn't even preserve the order of
'first occurrences' of the duplicate rows. Does your dict based
implementation respect this requirement?


Regards,
eat

> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20111220/af3a2f43/attachment.html>


More information about the SciPy-User mailing list