How to iterate through two dicts efficiently

Dave Angel davea at ieee.org
Tue Jul 19 07:39:39 EDT 2011


On 01/-10/-28163 02:59 PM, J wrote:
> Hello,
>
> I am looking to improve the performance of the following piece of Python code:-
>
> for cc in StatusContainer:
> 	for srv in StatusContainer[cc]:
> 		for id in StatusContainer[cc][srv]['RECV']:
> 			if id in StageContainer['13']:
> 				StatusContainer[cc][srv].setdefault('ACK', []).append(id)
> 			elif id in StageContainer['14']:
> 				StatusContainer[cc][srv].setdefault('NACK', []).append(id)
> 			else:
> 				if id in StageContainer['504'] or id in StageContainer['505']:
> 					StatusContainer[cc][srv].setdefault('RETRY', []).append(id)
>
> I am new to programming and I decided that I would use python as my first programming language.  I have written couple of (very basic but useful) scripts and now I am looking to optimise the performance of some of these scripts so I can make use of them.
>
> The code above is an extract from a bigger script which I am using to count how many message IDs have been received, sent or rejected per service per country.  The code is iterating through two dictionaries (a status dictionary which is a set of sub-dictionaries and a stages dictionary which again is a set of sub-dictionaries), if an entry (in this case an id) is found in the status dictionary, I am performing a search for that id in the stages dictionary.  Depending where in the stages dictionary the id is found (which stage), then I am creating a new sub-dictionary inside the status dictionary and I append the id.
>
> At the end, all I do is count how many IDs each sub-dictionary is holding (e.g: len(StatusContainer[cc][srv]['RECV'])).
>
> My two dictionaries look like this:-
>
> StatusContainer:-
>
> defaultdict(<type 'set'>, {'FR': {'VDMS': {'RECV': ['cad1c651-a61e-11e0-9f43-00238bbdc9eb', 'df25a1d1-a61e-11e0-9f43-00238bbdc9eb', '1a9452c0-a61f-11e0-9f43-00238bbdc9eb', '1a9b3090-a61f-11e0-9f43-00238bbdc9eb']}}, 'PT': {'VDMS': {'RECV': ['2410e980-a61f-11e0-cffc-00238bbdc9e7', '24146bf0-a61f-11e0-cffc-00238bbdc9e7', '29c5f0f0-a61f-11e0-cffc-00238bbdc9e7', '2ebe5f20-a61f-11e0-cffc-00238bbdc9e7']}}, 'CL': {'VDMS': {'RECV': ['379cfda0-a61e-11e0-9bd6-00238bbdc9e7', 'ff60f990-a61e-11e0-9bd6-00238bbdc9e7', '1a7e80d0-a61f-11e0-9bd6-00238bbdc9e7', '7ab64f00-a61f-11e0-9bd6-00238bbdc9e7']}}})
>
> StageContainer:-
>
> defaultdict(<type 'set'>, {'13': ['17897931-a61e-11e0-a764-00238bce423b', 'd0fcb681-a61d-11e0-e693-00238bbdc9eb', '178819a1-a61e-11e0-a764-00238bce423b', '17161df1-a61e-11e0-b558-00238bbdc9e7', '15eb5991-a61e-11e0-8aeb-00238bbdc9b7', '1872eed0-a61e-11e0-de66-00238bce424c', 'b3e470b1-a61d-11e0-b558-00238bbdc9e7'], '14': ['1a98dc11-a61e-11e0-a9e7-00238bce423b', 'da485231-a61d-11e0-ff8f-00238bce423b', '1c245e11-a61e-11e0-8aeb-00238bbdc9b7', '1eab5711-a61e-11e0-da5a-00238bce423b', '8db533c1-a61d-11e0-9393-00238bbdc9b7', 'e0e066a1-a61d-11e0-b558-00238bbdc9e7']})
>
> I ran this piece of code through the python profiler and I see that the code is taking on average around 1300 CPU secs to execute, depending on the size of the dicts.  The problem is that both dictionaries will always hold hundreds of thousands of entries, so I need to re-write this code to be more efficient or I need to find a different way to achieve my goal.
>
> Someone in a different forum suggested that I use 'binary search' to iterate through the dictionaries but like I mentioned before, I am still very new to programming and 'binary search' sounds a bit to complicated to me.  I was wondering if someone here can have a look at my code above and maybe re-write it using 'binary search' or any other way you may see fit and re-post the code here with some comments on what each section is doing?
>
> Your help would be very much appreciated.
>
> Regards,
>
> Junior
>
It would have been better if the code you submit actually did anything.  
None of the values in your StatusContainer show up in the 
StageContainer, so the inner loop in your code never stores anything.

You mention that 1300 seconds is taken for this loop.  But what fraction 
of the total is that?  Is this really the bottleneck?

Assuming it is, the most important single thing i'd be doing is to 
reorganize StageContainer.  You have the default type being set, but you 
actually store lists as values instead.  The value of item '13' is a 
list of values, when you probably meant it to be a set.

You could fix that for testing with something like:


StageContainer["13"] = set(StageContainer["13"])
StageContainer["14"] = set(StageContainer["14"])
StageContainer["504"] = set(StageContainer["504"])
StageContainer["505"] = set(StageContainer["505"])

though of course it'd be better to change the logic which builds those sets.

The second thing I'd do is to change the loops to iterate over the items 
of the list, instead of the keys.  For example, change

		for id in StatusContainer[cc][srv]['RECV']:



to something like:
                  for  key, val in StatusContainer[cc][srv].items()
                         id = val['RECV']

then below you'd have lines like:

                       val.setdefault('ACK', []).append(id)

Obviously you'd pick a better name than key and val.

This last change should help some, but not nearly as much as the first 
one.  You could continue to the outer loops, but it probably won't make 
much difference.

The code listed above is untested, so it probably has typos, or maybe 
even gross errors.

DaveA





More information about the Python-list mailing list