pandas dataframe, find duplicates and add suffix

Pavol Lisy pavol.lisy at gmail.com
Thu Mar 30 08:23:03 EDT 2017


On 3/28/17, zljubisic at gmail.com <zljubisic at gmail.com> wrote:
> In dataframe
>
> import pandas as pd
>
> data = {'model': ['first', 'first', 'second', 'second', 'second', 'third',
> 'third'],
>         'dtime': ['2017-01-01_112233', '2017-01-01_112234',
> '2017-01-01_112234', '2017-01-01_112234', '2017-01-01_112234',
> '2017-01-01_112235', '2017-01-01_112235'],
>         }
> df = pd.DataFrame(data, index = ['a.jpg', 'b.jpg', 'c.jpg', 'd.jpg',
> 'e.jpg', 'f.jpg', 'g.jpg'], columns=['model', 'dtime'])
>
> print(df.head(10))
>
>         model              dtime
> a.jpg   first  2017-01-01_112233
> b.jpg   first  2017-01-01_112234
> c.jpg  second  2017-01-01_112234
> d.jpg  second  2017-01-01_112234
> e.jpg  second  2017-01-01_112234
> f.jpg   third  2017-01-01_112235
> g.jpg   third  2017-01-01_112235
>
> within model, there are duplicate dtime values.
> For example, rows d and e are duplicates of the c row.
> Row g is duplicate of the f row.
>
> For each duplicate (within model) I would like to add suffix (starting from
> 1) to the dtime value. Something like this:
>
>         model              dtime
> a.jpg   first  2017-01-01_112233
> b.jpg   first  2017-01-01_112234
> c.jpg  second  2017-01-01_112234
> d.jpg  second  2017-01-01_112234-1
> e.jpg  second  2017-01-01_112234-2
> f.jpg   third  2017-01-01_112235
> g.jpg   third  2017-01-01_112235-1
>
> How to do that?
> --
> https://mail.python.org/mailman/listinfo/python-list
>

I am not expert, just played a little...

This one could work:

    gb = df.groupby([df.model, df.dtime])
    df.dtime = df.dtime + gb.cumcount().apply(lambda a:str(-a) if a else '')

this one is probably more readable:
    df.dtime = df.dtime + [str(-a) if a else '' for a in gb.cumcount()]

I don't know which one is better in memory consumption and/or speed.

This small dataframe gave me:

%timeit -r 5 df.dtime + gb.cumcount().apply(lambda a:str(-a) if a else '')
1000 loops, best of 5: 387 µs per loop

%timeit -r 5 df.dtime + [str(-a) if a else '' for a in gb.cumcount()]
1000 loops, best of 5: 324 µs per loop

PL.


More information about the Python-list mailing list