How to iterate through two dicts efficiently

Tim Chase python.list at tim.thechases.com
Tue Jul 19 13:32:36 CEST 2011


On 07/19/2011 04:36 AM, J wrote:
> Someone in a different forum suggested that I use 'binary
> search' to iterate through the dictionaries

I'm not sure what they were smoking...a binary search is useful 
for finding a thing in a sorted list.  It looks like your data is 
not sorted (strike #1) and it looks like you're gathering 
statistics instead of trying to find a particular value (strike #2).

> I am looking to improve the performance of the following piece of Python code:-
>
> for cc in StatusContainer:
> 	for srv in StatusContainer[cc]:
> 		for id in StatusContainer[cc][srv]['RECV']:

I'd rewrite this to use tuple unpacking and the .iteritems() 
method of dicts (which might not have any performance 
improvements, but improves readability:

   for cc, services in StatusContainer.iteritems():
     for srv, data in services.iteritems():
       for id in data['RECV']:  # I'd also avoid shadowing "id"
         ...

> 			if id in StageContainer['13']:
> 				StatusContainer[cc][srv].setdefault('ACK', []).append(id)
> 			elif id in StageContainer['14']:
> 				StatusContainer[cc][srv].setdefault('NACK', []).append(id)

Rather than look these up each time, I'd hoist them out of the 
loops that don't change the values and give them semantic 
meaning.  You also seem to be using defaultdict() everywhere 
else, so I'd set it up so that StatusContainer[cc][srv] is a 
defaultdict(list) as well to clean up that logic:

   acks = StageContainer['13']
   nacks = StageContainer['14']
   my_504s = StageContainer['504']
   my_505s = StageContainer['505']

   for cc, services in ...
     for ...
       container = StatusContainer[cc][srv]
       for ...
         if id in acks:
           container["ACK"].append(id)
         elif id in nacks:
           container["NACK"].append(id)
         else:
           if id in my_504s or id in my_505s:
             container["RETRY"].append(id)

If your codes can only be 13, 14, 504 or 505, I'd even skip the 
test at the end:

   else:
     container["RETRY"].append(id)

but that would take knowing your data, something I can't confirm 
is safe without the full data-set.

That said, all those are minor improvements that will likely only 
minor speed-ups in your code.


> My two dictionaries look like this:- >
> StatusContainer:-
>
> defaultdict(<type 'set'>, {'FR': {'VDMS': {'RECV': ...

I'm not sure why this is showing "<type 'set'>" when the 
second-level data is clearly a dict, not a set.  Same here:

> StageContainer:-
>
> defaultdict(<type 'set'>, {'13': ['17897931-a61e-11e0-a764-00238bce423b', 'd0fcb681-a61d-11e0-e693-00238bbdc9eb'

Additionally, I notice the things you're searching through in 
your StageContainer are *lists*.  Yet for memberships testing, 
you're using "in".  This is likely where your slowdown is coming. 
  With the above changes to add semantic meaning to your 
13/14/504/505s, just convert the results to a set:

   acks = set(StageContainer['13'])
   nacks = set(StageContainer['14'])
   retries = set(StageContent['504']) + set(StageContent['505'])

...

   else:
     if id in retries:  # instead of the "id in my504 or id in my505"

By changing this data-type from a list to a set, you change from 
O(N) to O(1) for lookup time.  For the non-CS geeks, that means 
that doing an "in" on a list is related to the length of the data 
in the list, while doing an "in" on a set is a constant time 
regardless of the list-length.  Assuming your data-dumps are 
accurate and that these are lists, this will make a world of 
difference in the speed of your run.

-tkc






More information about the Python-list mailing list