[Tutor] Re: Unique Items in Lists
Kent Johnson
kent37 at tds.net
Thu Jan 27 22:08:59 CET 2005
Brian van den Broek wrote:
> incorporating some of Wolfram's and Kent's (hope I've missed no one)
> suggestions:
>
> <code>
> def dups_in_list_report(a_list):
> '''Prints a duplication report for a list.'''
>
> items_dict = {}
>
> for i in a_list:
> items_dict[i] = items_dict.get(i, 0) + 1
>
> for k, v in sorted(items_dict.iteritems()): # cf below
> if v > 1:
> print '%s occurred %s times' %(k, v)
> </code>
>
> And, I can't but agree that this is much better! Thanks folks.
>
> In trying to improve the code, I first had:
>
> for key in sorted(items_dict.keys()):
> if items_dict[key] > 1:
> print '%s occurred %s times' %(key, items_dict[key])
>
> in place of the for loop over .iteritems(). Am I right in thinking that
> the advantage of Kent's suggestion of .iteritems() is that it eliminates
> some of the dict lookups? Other advantages?
iteritems() doesn't create an intermediate list so that should be a slight time and memory savings.
And my guess is that it is faster because it saves the lookups but if you care you should test.
Mostly I prefer iteritems() because it seems cleaner and clearer to me.
>
> Finally, in the first instance, I was aiming for the OP's stated end. To
> make this more general and reusable, I think I'd do:
>
> <code>
> def get_list_dup_dict(a_list, threshold=1):
> '''Returns a dict of items in list that occur threshold many times
>
> threshold defaults to 1. The dict returned has items occurring at least
> threshold many times as keys, and number of occurrences as values.
> '''
>
> items_dict, dup_dict = {}, {} # Question below
>
> for i in a_list:
> items_dict[i] = items_dict.get(i, 0) + 1
>
> for k, v in items_dict.iteritems():
> if v >= threshold:
> dup_dict[k] = v #Question below
>
> return dup_dict
>
> def print_list_dup_report(a_list, threshold=1):
> '''Prints report of items in a_list occurring at least threshold
> many times
>
> threshold defaults to 1. get_list_dup_dict(a_list, threshold=0) is
> called.
> returning a dict of items in list that occur at least threshold many
> times
> as keys and their number of repetitions as values.
>
> This dict is looped over to print a sorted and formatted duplication
> report.
> '''
>
> dup_dict = get_list_dup_dict(a_list, threshold)
> for k, v in sorted(dup_dict.iteritems()):
> print '%s occurred %s times' %(k, v)
> </code>
>
>
> My question (from comment in new code):
>
> Since I've split the task into two functions, one to return a
> duplication dictionary, the other to print a report based on it, I think
> the distinct dup_dict is needed. (I do want the get_list_dup_dict
> function to return a dict for possible use in other contexts.)
You could have a function that gets all of the counts, and then filter on threshold when you print.
>
> The alternative would be to iterate over a .copy() of the items_dict and
> delete items not meeting the threshold from items_dict, returning the
> pruned items_dict at the end. But, dup_dict is guaranteed to be smaller,
> save for original lists with no duplications and threshold set to 1. So,
> the distinct dup_dict way seems better for memory.
>
> Am I overlooking yet another dict technique that would help, here? Any
> other improvements?
In Python 2.4 I think you could do
dup_dict = dict((k, v) for k, v in items_dict.iteritems() if v >= threshold)
which I think is likely to be quite fast (but again if you care you should test).
Kent
>
> Thanks and best to all,
>
> Brian vdB
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
More information about the Tutor
mailing list