[Python-Dev] Proposal: defaultdict

Guido van Rossum guido at python.org
Fri Feb 17 20:09:47 CET 2006


On 2/16/06, Guido van Rossum <guido at python.org> wrote:
> Over lunch with Alex Martelli, he proposed that a subclass of dict
> with this behavior (but implemented in C) would be a good addition to
> the language. It looks like it wouldn't be hard to implement. It could
> be a builtin named defaultdict. The first, required, argument to the
> constructor should be the default value. Remaining arguments (even
> keyword args) are passed unchanged to the dict constructor.

Thanks for all the constructive feedback. Here are some responses and
a new proposal.

- Yes, I'd like to kill setdefault() in 3.0 if not sooner.

- It would indeed be nice if this was an optional feature of the
standard dict type.

- I'm ignoring the request for other features (ordering, key
transforms). If you want one of these, write a PEP!

- Many, many people suggested to use a factory function instead of a
default value. This is indeed a much better idea (although slightly
more cumbersome for the simplest cases).

- Some people seem to think that a subclass constructor signature must
match the base class constructor signature. That's not so. The
subclass constructor must just be careful to call the base class
constructor with the correct arguments. Think of the subclass
constructor as a factory function.

- There's a fundamental difference between associating the default
value with the dict object, and associating it with the call. So
proposals to invent a better name/signature for setdefault() don't
compete. (As to one specific such proposal, adding an optional bool as
the 3rd argument to get(), I believe I've explained enough times in
the past that flag-like arguments that always get a constant passed in
at the call site are a bad idea and should usually be refactored into
two separate methods.)

- The inconsistency introduced by __getitem__() returning a value for
keys while get(), __contains__(), and keys() etc. don't show it,
cannot be resolved usefully. You'll just have to live with it.
Modifying get() to do the same thing as __getitem__() doesn't seem
useful -- it just takes away a potentially useful operation.

So here's a new proposal.

Let's add a generic missing-key handling method to the dict class, as
well as a default_factory slot initialized to None. The implementation
is like this (but in C):

def on_missing(self, key):
  if self.default_factory is not None:
    value = self.default_factory()
    self[key] = value
    return value
  raise KeyError(key)

When __getitem__() (and *only* __getitem__()) finds that the requested
key is not present in the dict, it calls self.on_missing(key) and
returns whatever it returns -- or raises whatever it raises.
__getitem__() doesn't need to raise KeyError any more, that's done by
on_missing().

The on_missing() method can be overridden to implement any semantics
you want when the key isn't found: return a value without inserting
it, insert a value without copying it, only do it for certain key
types/values, make the default incorporate the key, etc.

But the default implementation is designed so that we can write

d = {}
d.default_factory = list

to create a dict that inserts a new list whenever a key is not found
in __getitem__(), which is most useful in the original use case:
implementing a multiset so that one can write

d[key].append(value)

to add a new key/value to the multiset without having to handle the
case separately where the key isn't in the dict yet. This also works
for sets instead of lists:

d = {}
d.default_factory = set
...
d[key].add(value)

I went through several iterations to obtain this design; my first
version of on_missing() would just raise KeyError(key), requiring you
to always provide a subclass; this is more minimalistic but less
useful and would probably raise the bar for using the feature to some
extent.

To saev you attempts to simplify this, here are some near-misses I
considered that didn't quite work out:

- def on_missing(self, key):
    if self.default_factory is not None:
      return self.default_factory()
    raise KeyError(key)

This would require the multiset example to subclass, since
default_factory doesn't see the key so it can't insert it.

- def on_missing(self, key):
    if self.default_factory is not None:
      return self.default_factory(key)
    raise KeyError(key)

This appears to fix that problem, but now you can't write
"d.default_value = list" since (a) list(key) doesn't return an empty
list, and (b) it also doesn't insert the key into the dict; attempting
to assign a callback function to default_factory that solves these
issues fail because the callback doesn't have access to the dict
instance (unless there's only one).

- Do away with on_missing() and just include its body at the end of
__getitem__(), to be invoked when the key isn't found.

This is less general in case you want different default semantics
(e.g. not inserting the default, or making the default a function of
the key) -- you'd have to override __getitem__() for that, which means
you'd be paying overhead even for keys that *are* present.

I'll try to cook up an implementation on SF after I've dug myself out
of the day's email barrage.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-Dev mailing list