[Python-Dev] defaultdict and on_missing()

Wed Feb 22 16:49:48 CET 2006

On 2/22/06, Raymond Hettinger <raymond.hettinger at verizon.net> wrote:
> I'm concerned that the on_missing() part of the proposal is gratuitous.  The
> main use cases for defaultdict have a simple factory that supplies a zero,
> empty list, or empty set.  The on_missing() hook is only there to support
> the rarer case of needing a key to compute a default value.  The hook is not
> needed for the main use cases.

The on_missing() hook is there to take the action of inserting the
default value into the dict. For this it needs the key.

It seems attractive to collaps default_factory and on_missing into a
single attribute (my first attempt did this, and I was halfway posting
about it before I realized the mistake). But on_missing() really needs
the key, and at the same time you don't want to lose the convenience
of being able to specify set, list, int etc. as default factories, so
default_factory() must be called without the key.

If you don't have on_missing, then the functionality of inserting the
key produced by default_factory would have to be in-lined in
__getitem__, which means the machinery put in place can't be reused
for other use cases -- several people have claimed to have a use case
for returning a value *without* inserting it into the dict.

> As it stands, we're adding a method to regular dicts that cannot be usefully
> called directly.  Essentially, it is a framework method meant to be
> overridden in a subclass.  So, it only makes sense in the context of
> subclassing.  In the meantime, we've added an oddball method to the main
> dict API, arguably the most important object API in Python.

Which to me actually means it's a *good* place to put the hook
functionality, since it allows for maximum reuse.

> To use the hook, you write something like this:
>
>     class D(dict):
>         def on_missing(self, key):
>              return somefunc(key)

Or, more likely,

def on_missing(key):
    self[key] = value = somefunc()
    return value

> However, we can already do something like that without the hook:
>
>     class D(dict):
>         def __getitem__(self, key):
>             try:
>                 return dict.__getitem__(self, key)
>             except KeyError:
>                 self[key] = value = somefunc(key)
>                 return value
>
> The latter form is already possible, doesn't require modifying a basic API,
> and is arguably clearer about when it is called and what it does (the former
> doesn't explicitly show that the returned value gets saved in the
> dictionary).

This is exactly what Google's internal DefaultDict does. But it is
also its downfall, because now *all* __getitem__ calls are weighed
down by going through Python code; in a particular case that came up
at Google I had to recommend against using it for performance reasons.

> Since we can already do the latter form, we can get some insight into
> whether the need has ever actually arisen in real code.  I scanned the usual
> sources (my own code, the standard library, and my most commonly used
> third-party libraries) and found no instances of code like that.   The
> closest approximation was safe_substitute() in string.Template where missing
> keys returned themselves as a default value.  Other than that, I conclude
> that there isn't sufficient need to warrant adding a funky method to the API
> for regular dicts.

In this case I don't believe that the absence of real-life examples
says much (and BTW Google's DefaultDict *is* such a real life example;
it is used in other code). There is not much incentive for subclassing
dict and overriding __getitem__ if the alternative is that in a few
places you have to write two lines of code instead of one:

    if key not in d: d[key] = set()    # this line would be unneeded
    d[key].add(value)

> I wondered why the safe_substitute() example was unique.  I think the answer
> is that we normally handle default computations through simple in-line code
> ("if k in d: do1() else do2()" or a try/except pair).  Overriding
> on_missing() then is really only useful when you need to create a type that
> can be passed to a client function that was expecting a regular dictionary.
> So it does come-up but not much.

I think the pattern hasn't been commonly known; people have been
struggling with setdefault() all these years.

> Aside:  Why on_missing() is an oddball among dict methods.  When teaching
> dicts to beginner, all the methods are easily explainable except this one.

You don't seriously teach beginners all dict methods do you?
setdefault(), update(), copy() are all advanced material, and so are
iteritems(), itervalues() and iterkeys() (*especially* the last since
it's redundant through "for i in d:").

> You don't call this method directly, you only use it when subclassing, you
> have to override it to do anything useful, it hooks KeyError but only when
> raised by __getitem__ and not other methods, etc.

The only other methods that raise KeyError are __delitem__, pop() and
popitem(). I don't see how these could use the same hook as
__getitem__ if the only real known use case for the latter is a hook
that inserts the value -- these methods all *delete* an item, so they
would need a different hook anyway (two different hooks, really, since
__delitem__ doesn't need a value). And I can't even think of a
theoretical use case for hooking these, let alone a real one.

> I'm concerned that
> evening having this method in regular dict API will create confusion about
> when to use dict.get(), when to use dict.setdefault(), when to catch a
> KeyError, or when to LBYL.  Adding this one extra choice makes the choice
> more difficult.

Well, obviously if you're not subclassing you can't use on_missing(),
so it doesn't really add much to the available choices, *unless* you
subclass, which is a choice you're likely to make in a different phase
of the design, and not lightly.

> My recommendation:  Dump the on_missing() hook.  That leaves the dict API
> unmolested and allows a more straight-forward implementation/explanation of
> collections.default_dict or whatever it ends-up being named.  The result is
> delightfully simple and easy to understand/explain.

I disagree. on_missing() is exactly the right refactoring. If we
removed on_missing() from dict, we'd have to override __getitem__ in
defaultdict (regardless of whether we give defaultdict an on_missing()
hook or in-line it). But the base class __getitem__ is a careful piece
of work! The override in defaultdict basically has two choices: invoke
dict.__getitem__ and catch the KeyError exception, or copy all the
code. (Using PyDict_GetItem would be even more wrong since it
suppresses exceptions in the hash and comparison phase of the lookup.)
Copying all the code is fraught with maintenance problems. Calling
dict.__getitem__ has the problem that it *could* raise KeyError for
reasons that have nothing to do (directly) with a missing item -- a
broken hash or comparison could also raise this, and in that case it
would be a mistake to call on_missing().

IMO pretty much the only reason for keeping the changes contained
within the collections module would be code modularity; but the above
argument about code reuse deconstructs that argument.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)