[Tutor] Lists of duplicates
Steven D'Aprano
steve at pearwood.info
Thu Mar 9 08:54:22 EST 2017
On Thu, Mar 09, 2017 at 01:26:26AM +0530, Sri Kavi wrote:
> I wrote and submitted the following function:
>
> def sublist_duplicates(lst):
> sublists = []
> for item in set(lst):
> sublists.append([item] * lst.count(item))
> return sublists
>
> lst_of_duplicates = [1, 2, 10, 3, 4, 1, 's', 2, 3, 1, 4, 's']
>
> print sublist_duplicates(lst_of_duplicates)
>
> A participant told me that that'll grind into a near-halt if there are many
> different elements, for example range(1_000_000) (1 million unique values,
> so it checks the whole list (length 1 million) for the count, a million
> times, a total of 1000 billion operations…
Correct. lst.count(item) has to walk the entire list, counting elements.
It does that in C code, so it is fast, but still, if there are a lot of
elements to walk, it takes some time. Then you do it again, and again,
and again...
For a list with N items, your sublist_duplicates function walks the
entire list N*N times. When N is large, N**2 is even larger.
> I thought he was right. I wrote another version of the function, this time
> using collections.Counter.
>
> from collections import Counter
>
> def sublist_duplicates(lst):
> counts = Counter(lst)
> sublists = [[k] * counts[k] for k in counts]
> return sublists
>
> _lst = list(range(1000000))
[...]
> I found that the first version works better with a small list,
Yes, but for a small list the total time is likely to be so small that
it doesn't really matter.
> but the
> second version outperforms the first one when it’s given a considerably big
> list of data.
>
>
> I’m asking for your comments on these two functions. I’m not even sure if
> this is how it should be coded.
I agree with your analysis.
--
Steve
More information about the Tutor
mailing list