[Python-Dev] Proposal: defaultdict

Sun Feb 19 08:42:56 CET 2006

"Raymond Hettinger" <raymond.hettinger at verizon.net> wrote:
> [Martin v. Löwis]
> > This kind of invariant doesn't take into account
> > that there might be a default value.
> 
> Precisely.  Therefore, a defaultdict subclass violates the Liskov Substitution 
> Principle.

class defaultdict(dict):
    def __getitem__(self, key):
        try:
            return dict.__getitem__(self, key)
        except KeyError:
            return self.on_missing(key)
    def on_missing(self, key):
        if not hasattr(self, 'default') or not callable(self.default):
            raise KeyError, key
        r = self[key] = self.default()
        return r

In my opinion, the above implementation as a subclass "does the right
thing" in regards to __del__, __contains__, get, pop, popitem, __len__,
has_key, and anything else I can think of.  Does it violate the Liskov
Substitution Principle?  Yes, but only if user code relies on dd[key]
raising a KeyError on a lack of a key.  This can be easily remedied by
removing the default when it is unneeded, at which point, you get your
Liskov Substitution.

> Of course, the __del__ followed __contains__ sequence is not the only invariant 
> that is thrown-off.  There are plenty of examples.  Here's one that is 
> absolutely basic to the method's contract:
> 
>     k, v = dd.popitem()
>     assert k not in dd
> 
> Any code that was expecting a dictionary and uses popitem() as a means of 
> looping over and consuming entries will fail.

>>> a = defaultdict()
>>> a.default = list
>>> a['hello']
[]
>>> k, v = a.popitem()
>>> assert k not in a
>>> 

Seems to work for the above implementation.

> No one should kid themselves that a default dictionary is a drop-in substitute. 
> Much of the dict's API has an ambiguous meaning when applied to defaultdicts.

Actually, if one is careful, the dict's API is completely unchanged,
except for direct access to the object via b = a[i].

>>> del a['hello']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'hello'
>>> 'hello' in a
False
>>> a.get('hello')
>>> a.pop('hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'pop(): dictionary is empty'
>>> a.popitem()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'popitem(): dictionary is empty'
>>> len(a)
0
>>> a.has_key('hello')
False

> If all keys are in-theory predefined, what is the meaning of len(dd)?

It depends on the sequence of actions.  Play around with the above
defaultdict implementation.  From what I understood of Guido's original
post, this is essentially what he was proposing, only implemented in C.

> Should dd.items() include any entries where the value is equal to the default or 
> should the collection never store those?

Yes, it should store any value which was stored via 'dd[k]=v', or any
default value created via access by 'v=dd[k]' .

> If the former, then how do you access 
> the entries without looping over the whole contents?

Presumably one is looking for a single kind of default (empty list, 0,
etc.) because one wanted to accumulate into them, similar to one of the
following...

    for item, value in input:
        try:
            d[item] += value
            #or d[item].append(value)
        except KeyError:
            d[item] = value
            #or d[item] = [value]

which becomes

    for item in input:
        dd[item] += 1
        #or dd[item].append(value)

Once accumulation has occurred, iteration over them via .iteritems(),
.items(), .popitem(), etc., would progress exactly the same way as with
a regular dictionary.  If the code which is using the accumulated data
does things like...

    for key in wanted_keys:
        try:
            value = dd[key]
        except KeyError:
            continue
        #do something nontrivial with value

rather than...

    for key in wanted_keys:
        if key not in dd:
            continue
        value = dd[key]
        #do something nontrivial with value

Then the user has at least three options to make it 'work right':
1. User can change to using 'in' to iterate rather than relying on a
KeyError.
2. User could remember to remove the default.
3. User can create a copy of the default dictionary via dict(dd) and
pass it into the code which relies on the non-defaulting dictionary.

> If the latter, then do you 
> worry that "dd[v]=k" does not imply "(k,v) in dd.items()"?

I personally wouldn't want the latter.

My post probably hasn't convinced you, but much of the confusion, I
believe, is based on Martin's original belief that 'k in dd' should
always return true if there is a default.  One can argue that way, but
then you end up on the circular train of thought that gets you to "you
can't do anything useful if that is the case, .popitem() doesn't work,
len() is undefined, ...".  Keep it simple, keep it sane.

 - Josiah