[Tutor] Lists of duplicates

Thu Mar 9 03:55:56 EST 2017

Alan Gauld via Tutor wrote:

> On 08/03/17 19:56, Sri Kavi wrote:
> 
>> It’s about making a function that returns a list of lists, with each list
>> being all of the elements that are the same as another element in the
>> original list.
> 
> This is one of those problems where there is probably no
> definitive "correct" answer just different compromises.
> 
> My first reaction was to sort the initial list then iterate
> over it creating a new sublist for each change of value.
> But, it only works for homogenous lists. For mixed types
> I'd probably opt for a dictionary:
> 
> def f(lst):
>     res = {}
>     for item in lst:
> res.setdefault(item,[]).append(item)
>     return list(res.values())
> 
> But I've no idea how that performs compared to your methods.

Performance is probably comparable to the Counter() solution.

Digression:

There is one important difference that I see as an advantage of Alan's 
approach -- it keeps the actual values (below I use a defaultdict instead of 
the setdefault() method):

>>> from collections import defaultdict
>>> def group(items):
...     groups = defaultdict(list)
...     for item in items:
...         groups[item].append(item)
...     return groups
... 
>>> group([1, 2, 3, True, 2.0, 3.0])
defaultdict(<class 'list'>, {1: [1, True], 2: [2, 2.0], 3: [3, 3.0]})

Let's see if we can get output that is a bit more readable:

>>> def show(dd):
...     for k, v in sorted(dd.items()): print(k, "-->", v)
... 
>>> show(group([1, 2, 3, True, 2.0, 3.0]))
1 --> [1, True]
2 --> [2, 2.0]
3 --> [3, 3.0]

Now it is easy to go one step further and adapt the code to group data by an 
arbitrary property of item:

>>> def groupby(items, key=None):
...     if key is None:
...         pairs = ((v, v) for v in items)
...     else:
...         pairs = ((key(v), v) for v in items)
...     groups = defaultdict(list)
...     for k, v in pairs:
...         groups[k].append(v)
...     return groups
... 
>>> show(groupby(["a", "bb", "BB", "CC", "ccc", "bb"]))
BB --> ['BB']
CC --> ['CC']
a --> ['a']
bb --> ['bb', 'bb']
ccc --> ['ccc']
>>> show(groupby(["a", "bb", "BB", "CC", "ccc", "bb"], key=str.lower))
a --> ['a']
bb --> ['bb', 'BB', 'bb']
cc --> ['CC']
ccc --> ['ccc']
>>> show(groupby(["a", "bb", "BB", "CC", "ccc", "bb"], key=len))
1 --> ['a']
2 --> ['bb', 'BB', 'CC', 'bb']
3 --> ['ccc']
>>> show(groupby(["a", "bb", "BB", "CC", "ccc", "bb"], key=lambda s: 
s[:1].lower()))
a --> ['a']
b --> ['bb', 'BB', 'bb']
c --> ['CC', 'ccc']

Efficient, flexible -- and you may even sometimes use it for real-world 
problems.

PS: There is a very similar function, itertools.groupby(), in the standard 
library. In return for the restriction that only adjacent values can be in 
the same group it works lazily and uses very little memory.