[Baypiggies] Notes on defaultdict

Glen Jarvis glen at glenjarvis.com
Fri Aug 27 08:16:22 CEST 2010

Here are notes with code that I used in my Newbie Nugget presentation today.
This will give you time to review the code line by line (good suggestion
Wesley) and get a feel for what is happening.

As a reminder:

The input is:
input = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

And, we want an output dictionary so that something like this is reported:

output['red'] = [1]
output['blue'] = [2, 4]
output['yellow'] = [1, 3]

There are lots of ways to do this, and I invite other suggestions/input.

If we were to start with these few lines, we'll see some limited success:

Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> input = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red',
>>> output = {}
>>> output['yellow'] = [1]
>>> output['yellow'].append(3)
>>> print output
{'yellow': [1, 3]}

But, how do we sort this out so that we always initialize the value to be a
list value If we forget, we'll see something like this

[continuing from above]
>>> print output
{'yellow': [1, 3]}
>>> output['blue'].append(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'blue'

We can't append something to the stuff that is associated with the 'blue'
key, because nothing exists yet for the blue key. We need to initialize it
before we can append to it.

So, what we really want is to have some type of default behavior given to
us. For example,

Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> input = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red',
>>> output = {}
>>> for color, value in input:
...     if output.has_key(color):
...         output[color].append(value)
...     else:
...         output[color] = [value]
>>> print output
{'blue': [2, 4], 'red': [1], 'yellow': [1, 3]}

One could argue that this is just fine -- this is very clear, and it works.
And, frankly, there's merit to that argument.

However, if we did this many times with many different types of
dictionaries, and we couldn't really make a generic routine that we could
use to handle this, we'll have this same pattern over and over:

if something.has_key(…):
    something[…] = <normal behavior>
    something[…] = <some initializing behavior>

What if there was another design pattern to do this? What if you could
somehow set a 'default' behavior so that, if the key didn't exist, you
automatically got, in our example, a list so you could append to it. That
is, we'd have code that looked like this:


This does fit our brain, because we really just want to append values as
we're going and not have to think of the special case of (oh, if the key
exists do this; or if it doesn't, do that; etc.)

The collections.defaultdict object does the above. However, before we talk
about it -- there will be some in the audience that will say, "Wait a
minute…  This is what setdefault does." And, they'd be right.

So, let's take a break from this newbie nugget to give you another newbie
nugget -- the setdefult (unless you already know about it).

The set default method can be used on a dictionary to set the default for a
key. And, it can be done multiple times without causing a problem. So, for
my output dictionary, if I set the default to a list (for the key yellow,
for example), then whenever I access the output dictionary through the
yellow key, I get a dictionary automatically. And, thus, we can tack an
'append' on to that list that's given to us by default. Here's an example:

Python 2.5 (r25:51918, Sep 19 2006, 08:49:13)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> input = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red',
>>> output = {}
>>> for color, value in input:
...     output.setdefault(color, []).append(value)
>>> print output
{'blue': [2, 4], 'red': [1], 'yellow': [1, 3]}

For each key in the dictionary, the default value is a list and the value is
appended to it. Although, it does seem we're having to set the default for
every single key -- not for any key in the dictionary.

But, that's a good transition to the collections.defaultdict because that's
exactly what it does.

>>> from collections import defaultdict
>>> input = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red',
>>> output = defaultdict(list)
>>> for color, value in input:
...     output[color].append(value)
>>> print output
defaultdict(<type 'list'>, {'blue': [2, 4], 'red': [1], 'yellow': [1, 3]})

Don't be scared about this extra output -- just notice that the colors have
the right lists. If you wanted to print these out by keys, you'll still get
the results that you want:

>>> print output['blue']
[2, 4]

What a default dictionary type is, it's simply a subclassed version of the
dictionary that does a few special things.

You give it a basic 'factory function' -- in our case a list -- and that's
what's initialized for a key the first time it is accessed. We don't have to
worry if the key was previously initialized, etc..

Some will argue that this is less explicit -- and I would probably have been
one of them when I first started preparing this newbie nugget. But, after
doing over this a few times to explain it, I've finally started to come on
board to the idea. We're explicitly stating the default type when we
initialize the defaultdict variable.

I especially like it that the default dict gives a default type for any key
-- not for a specific key as setdefault seems to do (correct me if I
misspeak here).

But, more than anything, it's *much faster* than doing it the other ways..
But, I want you to prove that to yourself and to me… I have created a 1 GB
file with colors and numbers. I'll give you this file so that you can read
it in and create the dictionary of your choice, with the method of your

You can time how long it takes for dictionary creation, with the list
appending, and everything.

Remember that your dictionary will be in memory and will be about a gig of
ram also. If your machine doesn't have those resources, read only half the
file. In fact, it would be interesting to do this for a random number of
records in the file, multiple times, to get a feel for how this grows for
each data type.

Come on, it'll be good practice if you've never done this before… Lets
compare notes when you're finished.

Whatever you can do or imagine, begin it;
boldness has beauty, magic, and power in it.

-- Goethe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100826/46f370cf/attachment-0001.html>

More information about the Baypiggies mailing list