[Tutor] Re: Unique Items in Lists

Kent Johnson kent37 at tds.net
Thu Jan 27 22:08:59 CET 2005


Brian van den Broek wrote:
> incorporating some of Wolfram's and Kent's (hope I've missed no one) 
> suggestions:
> 
> <code>
> def dups_in_list_report(a_list):
>     '''Prints a duplication report for a list.'''
> 
>     items_dict = {}
> 
>     for i in a_list:
>         items_dict[i] = items_dict.get(i, 0) + 1
> 
>     for k, v in sorted(items_dict.iteritems()):   # cf below
>         if v > 1:
>             print '%s occurred %s times' %(k, v)
> </code>
> 
> And, I can't but agree that this is much better! Thanks folks.
> 
> In trying to improve the code, I first had:
> 
>     for key in sorted(items_dict.keys()):
>         if items_dict[key] > 1:
>             print '%s occurred %s times' %(key, items_dict[key])
> 
> in place of the for loop over .iteritems(). Am I right in thinking that 
> the advantage of Kent's suggestion of .iteritems() is that it eliminates 
> some of the dict lookups? Other advantages?

iteritems() doesn't create an intermediate list so that should be a slight time and memory savings. 
And my guess is that it is faster because it saves the lookups but if you care you should test. 
Mostly I prefer iteritems() because it seems cleaner and clearer to me.

> 
> Finally, in the first instance, I was aiming for the OP's stated end. To 
> make this more general and reusable, I think I'd do:
> 
> <code>
> def get_list_dup_dict(a_list, threshold=1):
>     '''Returns a dict of items in list that occur threshold many times
> 
>     threshold defaults to 1. The dict returned has items occurring at least
>     threshold many times as keys, and number of occurrences as values.
>     '''
> 
>     items_dict, dup_dict = {}, {}   # Question below
> 
>     for i in a_list:
>         items_dict[i] = items_dict.get(i, 0) + 1
> 
>     for k, v in items_dict.iteritems():
>         if v >= threshold:
>             dup_dict[k] = v    #Question below
> 
>     return dup_dict
> 
> def print_list_dup_report(a_list, threshold=1):
>     '''Prints report of items in a_list occurring at least threshold 
> many times
> 
>     threshold defaults to 1. get_list_dup_dict(a_list, threshold=0) is 
> called.
>     returning a dict of items in list that occur at least threshold many 
> times
>     as keys and their number of repetitions as values.
> 
>     This dict is looped over to print a sorted and formatted duplication
>     report.
>     '''
> 
>     dup_dict = get_list_dup_dict(a_list, threshold)
>     for k, v in sorted(dup_dict.iteritems()):
>         print '%s occurred %s times' %(k, v)
> </code>
> 
> 
> My question (from comment in new code):
> 
> Since I've split the task into two functions, one to return a 
> duplication dictionary, the other to print a report based on it, I think 
> the distinct dup_dict is needed. (I do want the get_list_dup_dict 
> function to return a dict for possible use in other contexts.)

You could have a function that gets all of the counts, and then filter on threshold when you print.
> 
> The alternative would be to iterate over a .copy() of the items_dict and 
> delete items not meeting the threshold from items_dict, returning the 
> pruned items_dict at the end. But, dup_dict is guaranteed to be smaller, 
> save for original lists with no duplications and threshold set to 1. So, 
> the distinct dup_dict way seems better for memory.
> 
> Am I overlooking yet another dict technique that would help, here? Any 
> other improvements?

In Python 2.4 I think you could do
dup_dict = dict((k, v) for k, v in items_dict.iteritems() if v >= threshold)
which I think is likely to be quite fast (but again if you care you should test).

Kent

> 
> Thanks and best to all,
> 
> Brian vdB
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 



More information about the Tutor mailing list