Mailman 3 Proposal: defaultdict - Python-Dev

newer
Re: [Python-Dev] str object going...

Proposal: defaultdict

Guido van Rossum

Feb. 16, 2006

10:11 p.m.

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap): if key not in d: d[key] = [] d[key].append(value) An alternative way to spell this uses setdefault(), but it's not very readable: d.setdefault(key, []).append(value) and it also suffers from creating an unnecessary list instance. (Timings were inconclusive; the approaches are within 5-10% of each other in speed.) My conclusion is that setdefault() is a failure -- it was a well-intentioned construct, but doesn't actually create more readable code. Google has an internal data type called a DefaultDict which gets passed a default value upon construction. Its __getitem__ method, instead of raising KeyError, inserts a shallow copy (!) of the given default value into the dict when the value is not found. So the above code, after d = DefaultDict([]) can be written as simply d[key].append(value) Note that of all the possible semantics for __getitem__ that could have produced similar results (e.g. not inserting the default in the underlying dict, or not copying the default value), the chosen semantics are the only ones that makes this example work. Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor. Some more design subtleties: - "key in d" still returns False if the key isn't there - "d.get(key)" still returns None if the key isn't there - "d.default" should be a read-only attribute giving the default value Feedback? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Show replies by date

Thomas Wouters

February 2006

10:21 p.m.

On Thu, Feb 16, 2006 at 01:11:49PM -0800, Guido van Rossum wrote:

...

Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Should a dict subclass really change the constructor/initializer signature in an incompatible way? -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Steve Holden

7:09 a.m.

Thomas Wouters wrote:

...

On Thu, Feb 16, 2006 at 01:11:49PM -0800, Guido van Rossum wrote:

...
Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Should a dict subclass really change the constructor/initializer signature in an incompatible way?

Dict is a particularly difficult type to subclass anyway, given that it can take an arbitrary number of arbitrarily-named keyword arguments (among many other argument styles). The proposed behavior is exactly how Icon tables behaved, and it was indeed useful in that language. Guido is right about setdefault being a busted flush. If there's no way to resolve the signature issue (which there may not be, given that dict({'one': 2, 'two': 3}) dict({'one': 2, 'two': 3}.items()) dict({'one': 2, 'two': 3}.iteritems()) dict(zip(('one', 'two'), (2, 3))) dict([['two', 3], ['one', 2]]) dict(one=2, two=3) dict([(['one', 'two'][i-2], i) for i in (2, 3)]) are all valid calls to the type) then a factory function would be a very acceptable substitute, no? (The function could make use of a subclass - there's surely no necessity to provide the default as an initializer argument: it could be provided as an argument to a method present only in the subclass). wishing-i-could-have-lunch-with-alex-ly y'rs - steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/

skip＠pobox.com

4:05 p.m.

Guido> Over lunch with Alex Martelli, he proposed that a subclass of Guido> dict with this behavior (but implemented in C) would be a good Guido> addition to the language. Instead, why not define setdefault() the way it should have been done in the first place? When you create a dict it has the current behavior. If you then call its setdefault() method that becomes the default value for missing keys. d = {'a': 1}' d['b'] # raises KeyError d.get('c') # evaluates to None d.setdefault(42) d['b'] # evaluates to 42 d.get('c') # evaluates to 42 For symmetry, setdefault() should probably be undoable: deldefault(), removedefault(), nodefault(), default_free(), whatever. The only question in my mind is whether or not getting a non-existent value under the influence of a given default value should stick that value in the dictionary or not. down-with-more-builtins-ly, y'rs, Skip

Georg Brandl

4:29 p.m.

skip@pobox.com wrote:

...

Guido> Over lunch with Alex Martelli, he proposed that a subclass of Guido> dict with this behavior (but implemented in C) would be a good Guido> addition to the language.

Instead, why not define setdefault() the way it should have been done in the first place? When you create a dict it has the current behavior. If you then call its setdefault() method that becomes the default value for missing keys.

That puts it off until 3.0.

...

From what I read I think defaultdict won't become builtin anyway.

Georg

Pierre Barbier de Reuille

12:53 p.m.

Quoting skip@pobox.com:

...

Guido> Over lunch with Alex Martelli, he proposed that a subclass of Guido> dict with this behavior (but implemented in C) would be a good Guido> addition to the language.

Instead, why not define setdefault() the way it should have been done in the first place? When you create a dict it has the current behavior. If you then call its setdefault() method that becomes the default value for missing keys.

d = {'a': 1}' d['b'] # raises KeyError d.get('c') # evaluates to None d.setdefault(42) d['b'] # evaluates to 42 d.get('c') # evaluates to 42

For symmetry, setdefault() should probably be undoable: deldefault(), removedefault(), nodefault(), default_free(), whatever.

Well, first not ot break the current interface, and second because I think it reads better I would prefer : d = {'a': 1}' d['b'] # raises KeyError d.get('c') # evaluates to None d.default = 42 d['b'] # evaluates to 42 d.get('c') # evaluates to 42 And to undo the default, you can simply do : del d.default And of course, you can get the current value : d.default But then, as proposed many times, I would rather see a function call. Like : d.default = lambda key: 42 The argument of the function is the current key. It would allow things like that : d.default = time_comsuming_operation where time_comsuming_operation get a single argument.

...

The only question in my mind is whether or not getting a non-existent value under the influence of a given default value should stick that value in the dictionary or not.

down-with-more-builtins-ly, y'rs,

Skip _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/pierre.barbier%40cirad.fr

Terry Reedy

6:13 a.m.

...

Quoting skip@pobox.com:

...
The only question in my mind is whether or not getting a non-existent value under the influence of a given default value should stick that value in the dictionary or not.

It seems to me that there are at least two types of default dicts, which have opposite answers to that question. One is a 'universal dict' that maps every key to something -- the default if nothing else. That should not have the default ever explicitly entered. Udict.keys() should only give the keys *not* mapped to the universal value. Another is the accumlator dict. The default value is the identity (0, [], or whatever) for the type of accumulation. An adict must have the identity added, even though that null will usually be immedially incremented by +=1 or .append(ob) or whatever. Guido's last proposal was for the default default_dict to cater to the second type (and others needing the same behavior) while catering to the first by making the default fill-in method over-rideable. It we go with, for instance, wrappers in the collections module instead of modification of dict, then perhaps there should be at least two wrappers included, with each of these two behaviors. Terry Jan Reedy

Raymond Hettinger

8:11 a.m.

[Terry Reedy]

...

One is a 'universal dict' that maps every key to something -- the default if nothing else. That should not have the default ever explicitly entered. Udict.keys() should only give the keys *not* mapped to the universal value.

Would you consider it a mapping invariant that "k in dd" implies "k in dd.keys()"? Is the notion of __contains__ at odds with notion of universality? Raymond

Terry Reedy

11:06 p.m.

"Raymond Hettinger" <raymond.hettinger@verizon.net> wrote in message news:005301c63523$b751b2b0$b83efea9@RaymondLaptop1...

...

[Terry Reedy]

...
One is a 'universal dict' that maps every key to something -- the default if nothing else. That should not have the default ever explicitly entered. Udict.keys() should only give the keys *not* mapped to the universal value.

Would you consider it a mapping invariant that "k in dd" implies "k in dd.keys()"?

Is the notion of __contains__ at odds with notion of universality?

No and not sure. I'll leave it to Martin v. Löwis to explain/defend his particular notion of a udict. My main point is that there are multiple legitimate variations of the notion of a default dict, so that there is no 'one right way' to design one. I notice that Michael Urman and Ian Bicking said much the same today. Of course, having said that different variations are useful in different situations, I would nowise claim that all variations can serve as drop-in replacements for regular dicts everywhere they are now used. I think that a new default-dict feature should cater to such variations. Beyond that, I don't know whether it is better to modify dict (with blank hooks) or add a new subclassable default-dict base type. Terry Jan Reedy

Barry Warsaw

5:11 a.m.

On Sat, 2006-02-18 at 12:53 +0100, Pierre Barbier de Reuille wrote:

...

...
Guido> Over lunch with Alex Martelli, he proposed that a subclass of Guido> dict with this behavior (but implemented in C) would be a good Guido> addition to the language.

I agree that .setdefault() is a well-intentioned failure, although I'm much less concerned about any potential performance impact than the fact that it's completely unreadable. And while I like the basic idea, I also agree that deriving from dict is problematic, both because of the constructor signature is tough to forward, but also because dict is such a fundamental type that APIs that return dicts may have to be changed to allow passing in a factory type. I'd rather like to see what Pierre proposes, with a few minor differences.

...

Well, first not ot break the current interface, and second because I think it reads better I would prefer :

d = {'a': 1}' d['b'] # raises KeyError d.get('c') # evaluates to None d.default = 42 d['b'] # evaluates to 42 d.get('c') # evaluates to 42

So far so good.

...

And to undo the default, you can simply do :

del d.default

Although this I'm not crazy about. If you let .default be a callable, you could also write this as def keyerror(): raise KeyError d.default = keyerror or possibly just this as a shortcut: d.default = KeyError

...

...
The only question in my mind is whether or not getting a non-existent value under the influence of a given default value should stick that value in the dictionary or not.

Agreed. I'm not sure whether .get(onearg) should return None or .default. I /think/ I want the latter, but I'd have to play with some real code to know for sure. -Barry

Greg Ewing

11:55 p.m.

Guido van Rossum wrote:

...

The first, required, argument to the constructor should be the default value.

I'd like to suggest that this argument be a function for creating default values, rather than an actual default value. This would avoid any confusion over exactly how the default value is copied. (Shallow or deep? How deep?) In an earlier discussion it was pointed out that this would be no less convenient for many common use cases, e.g. in your example, d = defaultdict(list) Also I'm not sure about the name "defaultdict". When I created a class like this recently, I called it an "autodict" (i.e. a dict that automatically extends itself with new entries). And perhaps the value should be called an "initial value" rather than a default value, to more strongly suggest that it becomes a permanent part of the dict. Greg

"Martin v. Löwis"

8:09 a.m.

Guido van Rossum wrote:

...

Feedback?

I would like this to be part of the standard dictionary type, rather than being a subtype. d.setdefault([]) (one argument) should install a default value, and d.cleardefault() should remove that setting; d.default should be read-only. Alternatively, d.default could be assignable and del-able. Also, I think has_key/in should return True if there is a default. Regards, Martin

Fuzzyman

12:02 p.m.

Martin v. Löwis wrote:

...

Guido van Rossum wrote:

...
Feedback?

I would like this to be part of the standard dictionary type, rather than being a subtype.

d.setdefault([]) (one argument) should install a default value, and d.cleardefault() should remove that setting; d.default should be read-only. Alternatively, d.default could be assignable and del-able.

Also, I think has_key/in should return True if there is a default.

And exactly what use would it then be ? Michael Foord

...

Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...

"Martin v. Löwis"

11 p.m.

Fuzzyman wrote:

...

...
Also, I think has_key/in should return True if there is a default.

And exactly what use would it then be ?

Code that checks if d.has_key(k): print d[k] would work correctly. IOW, you could use a dictionary with a default key just as if it were a normal dictionary - which is a useful property, IMO. Regards, Martin

Fredrik Lundh

12:39 p.m.

Martin v. Löwis wrote:

...

Also, I think has_key/in should return True if there is a default.

and keys should return all possible key values! </F>

skip＠pobox.com

7:20 p.m.

>> Also, I think has_key/in should return True if there is a default. Fredrik> and keys should return all possible key values! I think keys() and in should reflect reality. Only when you do something like x = d['nonexistent'] or x = d.get('nonexistent') should the default value come into play. Skip

bokr＠oz.net

7:37 p.m.

On Fri, 17 Feb 2006 08:09:23 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin@v.loewis.de> wrote:

...

Guido van Rossum wrote:

...
Feedback?

I would like this to be part of the standard dictionary type, rather than being a subtype.

d.setdefault([]) (one argument) should install a default value, and d.cleardefault() should remove that setting; d.default should be read-only. Alternatively, d.default could be assignable and del-able. I like the latter, but d.default_factory = callable # or None

Also, I think has_key/in should return True if there is a default. That seems iffy. ISTM potential should not define actual status.

Regards, Bengt Richter

Steve Holden

4:38 a.m.

Martin v. Löwis wrote:

...

Guido van Rossum wrote:

...
Feedback?

I would like this to be part of the standard dictionary type, rather than being a subtype.

d.setdefault([]) (one argument) should install a default value, and d.cleardefault() should remove that setting; d.default should be read-only. Alternatively, d.default could be assignable and del-able.

The issue with setting the default this way is that a copy would have to be created if the behavior was to differ from the sometimes-confusing default argument behavior for functions.

...

Also, I think has_key/in should return True if there is a default.

It certainly seems desirable to see True where d[some_key] doesn't raise an exception, but one could argue either way. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/

Raymond Hettinger

4:53 a.m.

...

...
Also, I think has_key/in should return True if there is a default.

...

It certainly seems desirable to see True where d[some_key] doesn't raise an exception, but one could argue either way.

Some things can be agreed by everyone: * if __contains__ always returns True, then it is a useless feature (since scripts containing a line such as "if k in dd" can always eliminate that line without affecting the algorithm). * if defaultdicts are supposed to be drop-in dict substitutes, then having __contains__ always return True will violate basic dict invariants: del d[some_key] assert some_key not in d Raymond

"Martin v. Löwis"

7:05 a.m.

Raymond Hettinger wrote:
>>>Also, I think has_key/in should return True if there is a default.
> * if __contains__ always returns True, then it is a useless feature (since 
> scripts containing a line such as "if k in dd" can always eliminate that line 
> without affecting the algorithm).

If you mean "if __contains__ always returns True for a default dict,
then it is a useless feature", I disagree. The code using "if k in dd"
cannot be eliminated if you don't know that you have a default dict.

> * if defaultdicts are supposed to be drop-in dict substitutes, then having
> __contains__ always return True will violate basic dict invariants:
>    del d[some_key]
>    assert some_key not in d

If you have a default value, you cannot ultimately del a key. This
sequence is *not* a basic mapping invariant. If it was, then it would
be also an invariant that, after del d[some_key], d[some_key] will
raise a KeyError. This kind of invariant doesn't take into account
that there might be a default value.

Regards,
Martin

Raymond Hettinger

7:33 a.m.

[Martin v. Löwis]

...

If you have a default value, you cannot ultimately del a key. This sequence is *not* a basic mapping invariant.

You believe that key deletion is not basic to mappings?

...

This kind of invariant doesn't take into account that there might be a default value.

Precisely. Therefore, a defaultdict subclass violates the Liskov Substitution Principle. Of course, the __del__ followed __contains__ sequence is not the only invariant that is thrown-off. There are plenty of examples. Here's one that is absolutely basic to the method's contract: k, v = dd.popitem() assert k not in dd Any code that was expecting a dictionary and uses popitem() as a means of looping over and consuming entries will fail. No one should kid themselves that a default dictionary is a drop-in substitute. Much of the dict's API has an ambiguous meaning when applied to defaultdicts. If all keys are in-theory predefined, what is the meaning of len(dd)? Should dd.items() include any entries where the value is equal to the default or should the collection never store those? If the former, then how do you access the entries without looping over the whole contents? If the latter, then do you worry that "dd[v]=k" does not imply "(k,v) in dd.items()"? Raymond

"Martin v. Löwis"

7:59 a.m.

Raymond Hettinger wrote:

...

...
If you have a default value, you cannot ultimately del a key. This sequence is *not* a basic mapping invariant.

You believe that key deletion is not basic to mappings?

No, not in the sense that the key will go away through deletion. I view a mapping as a modifiable partial function. There is some initial key/value association (in a classic mapping, it is initially empty), and then there are modifications. Key deletion means to reset the key to the initial association.

...

Of course, the __del__ followed __contains__ sequence is not the only invariant that is thrown-off. There are plenty of examples. Here's one that is absolutely basic to the method's contract:

k, v = dd.popitem() assert k not in dd

Any code that was expecting a dictionary and uses popitem() as a means of looping over and consuming entries will fail.

Well, code that loops over a dictionary using popitem typically terminates when the dictionary becomes false (or its length becomes zero). That code wouldn't be affected by the behaviour of "in".

...

No one should kid themselves that a default dictionary is a drop-in substitute. Much of the dict's API has an ambiguous meaning when applied to defaultdicts.

Right. But it is only ambiguous until specified. Of course, in the face of ambiguity, refuse the temptation to guess.

...

If all keys are in-theory predefined, what is the meaning of len(dd)?

Taking my definition from the beginning of the message, it is the number of keys that have been modified from the initial mapping.

...

Should dd.items() include any entries where the value is equal to the default or should the collection never store those?

It should include all modified items, and none of the unmodified ones. Explicitly assigning the default value still makes the entry modified; you need to del it to set it back to "unmodified".

...

If the former, then how do you access the entries without looping over the whole contents?

Not sure I understand the question. You use d[k] to access an entry. Regards, Martin

Josiah Carlson

8:42 a.m.

"Raymond Hettinger" <raymond.hettinger@verizon.net> wrote:

...

[Martin v. Löwis]

...
This kind of invariant doesn't take into account that there might be a default value.

Precisely. Therefore, a defaultdict subclass violates the Liskov Substitution Principle.

class defaultdict(dict): def __getitem__(self, key): try: return dict.__getitem__(self, key) except KeyError: return self.on_missing(key) def on_missing(self, key): if not hasattr(self, 'default') or not callable(self.default): raise KeyError, key r = self[key] = self.default() return r In my opinion, the above implementation as a subclass "does the right thing" in regards to __del__, __contains__, get, pop, popitem, __len__, has_key, and anything else I can think of. Does it violate the Liskov Substitution Principle? Yes, but only if user code relies on dd[key] raising a KeyError on a lack of a key. This can be easily remedied by removing the default when it is unneeded, at which point, you get your Liskov Substitution.

...

Of course, the __del__ followed __contains__ sequence is not the only invariant that is thrown-off. There are plenty of examples. Here's one that is absolutely basic to the method's contract:

k, v = dd.popitem() assert k not in dd

Any code that was expecting a dictionary and uses popitem() as a means of looping over and consuming entries will fail.

...

...
...
a = defaultdict() a.default = list a['hello'] [] k, v = a.popitem() assert k not in a

Seems to work for the above implementation.

...

No one should kid themselves that a default dictionary is a drop-in substitute. Much of the dict's API has an ambiguous meaning when applied to defaultdicts.

Actually, if one is careful, the dict's API is completely unchanged, except for direct access to the object via b = a[i].

...

...
...
del a['hello'] Traceback (most recent call last): File "<stdin>", line 1, in ? KeyError: 'hello' 'hello' in a False a.get('hello') a.pop('hello') Traceback (most recent call last): File "<stdin>", line 1, in ? KeyError: 'pop(): dictionary is empty' a.popitem() Traceback (most recent call last): File "<stdin>", line 1, in ? KeyError: 'popitem(): dictionary is empty' len(a) 0 a.has_key('hello') False

...

If all keys are in-theory predefined, what is the meaning of len(dd)?

It depends on the sequence of actions. Play around with the above defaultdict implementation. From what I understood of Guido's original post, this is essentially what he was proposing, only implemented in C.

...

Should dd.items() include any entries where the value is equal to the default or should the collection never store those?

Yes, it should store any value which was stored via 'dd[k]=v', or any default value created via access by 'v=dd[k]' .

...

If the former, then how do you access the entries without looping over the whole contents?

Presumably one is looking for a single kind of default (empty list, 0, etc.) because one wanted to accumulate into them, similar to one of the following... for item, value in input: try: d[item] += value #or d[item].append(value) except KeyError: d[item] = value #or d[item] = [value] which becomes for item in input: dd[item] += 1 #or dd[item].append(value) Once accumulation has occurred, iteration over them via .iteritems(), .items(), .popitem(), etc., would progress exactly the same way as with a regular dictionary. If the code which is using the accumulated data does things like... for key in wanted_keys: try: value = dd[key] except KeyError: continue #do something nontrivial with value rather than... for key in wanted_keys: if key not in dd: continue value = dd[key] #do something nontrivial with value Then the user has at least three options to make it 'work right': 1. User can change to using 'in' to iterate rather than relying on a KeyError. 2. User could remember to remove the default. 3. User can create a copy of the default dictionary via dict(dd) and pass it into the code which relies on the non-defaulting dictionary.

...

If the latter, then do you worry that "dd[v]=k" does not imply "(k,v) in dd.items()"?

I personally wouldn't want the latter. My post probably hasn't convinced you, but much of the confusion, I believe, is based on Martin's original belief that 'k in dd' should always return true if there is a default. One can argue that way, but then you end up on the circular train of thought that gets you to "you can't do anything useful if that is the case, .popitem() doesn't work, len() is undefined, ...". Keep it simple, keep it sane. - Josiah

Michael Urman

5:23 p.m.

On 2/19/06, Josiah Carlson <jcarlson@uci.edu> wrote:

...

My post probably hasn't convinced you, but much of the confusion, I believe, is based on Martin's original belief that 'k in dd' should always return true if there is a default. One can argue that way, but then you end up on the circular train of thought that gets you to "you can't do anything useful if that is the case, .popitem() doesn't work, len() is undefined, ...". Keep it simple, keep it sane.

A default factory implementation fundamentally modifies the behavior of the mapping. There is no single answer to the question "what is the right behavior for contains, len, popitem" as that depends on what the code that consumes the mapping is written like, what it is attempting to do, and what you are attempting to override it to do. Or, simply, on why you are providing a default value. Resisting the temptation to guess the why and just leaving the methods as is seems the best choice; overriding __contains__ to return true is much easier than reversing that behavior would be. An example when it could theoretically be used, if not particularly useful. The gettext.install() function was just updated to take a names parameter which controls which gettext accessor functions it adds to the builtin namespace. Its implementation looks for "method in names" to decide. Passing a default-true dict would allow the future behavior to be bind all checked names, but only if __contains__ returns True. Even though it would make a poor base implementation, and these effects aren't a good candidate for it, the code style that could best leverage such a __contains__ exists. Michael -- Michael Urman http://www.tortall.net/mu/blog

Ian Bicking

9:15 p.m.

Michael Urman wrote:

...

On 2/19/06, Josiah Carlson <jcarlson@uci.edu> wrote:

...
My post probably hasn't convinced you, but much of the confusion, I believe, is based on Martin's original belief that 'k in dd' should always return true if there is a default. One can argue that way, but then you end up on the circular train of thought that gets you to "you can't do anything useful if that is the case, .popitem() doesn't work, len() is undefined, ...". Keep it simple, keep it sane.

A default factory implementation fundamentally modifies the behavior of the mapping. There is no single answer to the question "what is the right behavior for contains, len, popitem" as that depends on what the code that consumes the mapping is written like, what it is attempting to do, and what you are attempting to override it to do. Or, simply, on why you are providing a default value. Resisting the temptation to guess the why and just leaving the methods as is seems the best choice; overriding __contains__ to return true is much easier than reversing that behavior would be.

I agree that there is simply no universally correct answer for the various uses of default_factory. I think ambiguity on points like this is a sign that something is overly general. In many of the concrete cases it is fairly clear how these methods should work. In the most obvious case (default_factory=list) what seems to be to be the correct implementation is one that no one is proposing, that is, "x in d" means "d.get(x)". But that uses the fact that the return value of default_factory() is a false value, which we cannot assume in general. And it effects .keys() -- which I would propose overriding for multidict (so it only returns keys with non-empty lists for values), but I don't see how it could be made correct for default_factory. I just don't see why we should cram all these potential features into dict by using a vague feature like default_factory. Why can't we just add a half-dozen new types of collections (to the module of the same name)? Each one will get its own page of documentation, a name, a proper __repr__, and well defined meaning for all of these methods that it shares with dict only insofar as it makes sense to share. Note that even if we use defaultdict or autodict or something besides changing dict itself, we still won't get a good __contains__, a good repr, or any of the other features that specific collection implementations will give us. Isn't there anyone else who sees the various dict-like objects being passed around as recipes, and thinks that maybe that's a sign they should go in the stdlib? The best of those recipes aren't all-encompassing, they just do one kind of container well. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org

Josiah Carlson

4:52 a.m.

"Michael Urman" <murman@gmail.com> wrote:

...

On 2/19/06, Josiah Carlson <jcarlson@uci.edu> wrote:

...
My post probably hasn't convinced you, but much of the confusion, I believe, is based on Martin's original belief that 'k in dd' should always return true if there is a default. One can argue that way, but then you end up on the circular train of thought that gets you to "you can't do anything useful if that is the case, .popitem() doesn't work, len() is undefined, ...". Keep it simple, keep it sane.

A default factory implementation fundamentally modifies the behavior of the mapping. There is no single answer to the question "what is the right behavior for contains, len, popitem" as that depends on what the code that consumes the mapping is written like, what it is attempting to do, and what you are attempting to override it to do. Or, simply, on why you are providing a default value. Resisting the temptation to guess the why and just leaving the methods as is seems the best choice; overriding __contains__ to return true is much easier than reversing that behavior would be.

I agree, there is nothing perfect. But at least in all of my use-cases, and the majority of the ones I've seen 'in the wild', my previous post provided an implementation that worked precisely like desired, and precisely like a regular dictionary, except when accessing a non-existant key via: value = dd[key] . __contains__, etc., all work exactly like they do with a non-defaulting dictionary. Iteration via popitem(), pop(key), items(), iteritems(), __iter__, etc., all work the way you would expect them. The only nit is that code which iterates like: for key in keys: try: value = dd[key] except KeyError: continue (where 'keys' has nothing to do with dd.keys(), it is merely a listing of keys which are desired at this particular point) However, the following works like it always did: for key in keys: if key not in dd: continue value = dd[key]

...

An example when it could theoretically be used, if not particularly useful. The gettext.install() function was just updated to take a names parameter which controls which gettext accessor functions it adds to the builtin namespace. Its implementation looks for "method in names" to decide. Passing a default-true dict would allow the future behavior to be bind all checked names, but only if __contains__ returns True.

Even though it would make a poor base implementation, and these effects aren't a good candidate for it, the code style that could best leverage such a __contains__ exists.

Indeed, there are cases where an always-true __contains__ exists, and the pure-Python implementation I previously posted can be easily modified to offer such a feature. However, because there are also use cases for the not-always-true __contains__, picking either as the "one true way" seems a bit unnecessary. Presumably, if one goes into the collections module, the other will too. Actually, they could share all of their code except for a simple flag which determines the always-true __contains__. With minor work, that 'flag', or really the single bit it would require, may even be embeddable into the type object. Arguably, there should be a handful of these defaulting dictionary-like objects, and for each variant, it should be documented what their use-cases are, and any gotcha's that will inevitably come up. - Josiah

Aahz

4:18 p.m.

On Sun, Feb 19, 2006, Josiah Carlson wrote:

...

I agree, there is nothing perfect. But at least in all of my use-cases, and the majority of the ones I've seen 'in the wild', my previous post provided an implementation that worked precisely like desired, and precisely like a regular dictionary, except when accessing a non-existant key via: value = dd[key] . __contains__, etc., all work exactly like they do with a non-defaulting dictionary. Iteration via popitem(), pop(key), items(), iteritems(), __iter__, etc., all work the way you would expect them.

This is the telling point, IMO. My company makes heavy use of a "default dict" (actually, it's a "default class" because using constants as the lookup keys is mostly what we do and the convenience of foo.bar is compelling over foo['bar']). Anyway, our semantics are as Josiah outlines, and I can't see much use case for the alternatives. Those of you arguing something different: do you have a real use case (that you've implemented in real code)? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis

Adam Olsen

7:22 p.m.

On 2/20/06, Aahz <aahz@pythoncraft.com> wrote:

...

On Sun, Feb 19, 2006, Josiah Carlson wrote:

...
I agree, there is nothing perfect. But at least in all of my use-cases, and the majority of the ones I've seen 'in the wild', my previous post provided an implementation that worked precisely like desired, and precisely like a regular dictionary, except when accessing a non-existant key via: value = dd[key] . __contains__, etc., all work exactly like they do with a non-defaulting dictionary. Iteration via popitem(), pop(key), items(), iteritems(), __iter__, etc., all work the way you would expect them.

This is the telling point, IMO. My company makes heavy use of a "default dict" (actually, it's a "default class" because using constants as the lookup keys is mostly what we do and the convenience of foo.bar is compelling over foo['bar']). Anyway, our semantics are as Josiah outlines, and I can't see much use case for the alternatives.

Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

...

Those of you arguing something different: do you have a real use case (that you've implemented in real code)?

(again, for the record) getorset provides the minimum needed functionality in a clean and intuitive way. Why go for a complicated solution when you simply don't need it? -- Adam Olsen, aka Rhamphoryncus

Josiah Carlson

9:24 p.m.

"Adam Olsen" <rhamph@gmail.com> wrote:

...

Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

It doesn't work for the multiset/accumulation case: dd[key] += 1 - Josiah

Guido van Rossum

9:54 p.m.

On 2/20/06, Josiah Carlson <jcarlson@uci.edu> wrote:

...

"Adam Olsen" <rhamph@gmail.com> wrote:

...
Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

It doesn't work for the multiset/accumulation case:

dd[key] += 1

This is actually a fairly powerful argument for a subclass that redefines __getitem__ in favor of a new dict method. (Not to mention that it's much easier to pick a name for the subclass than for the method. :-) See the new thread I started. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Aahz

9:38 p.m.

On Mon, Feb 20, 2006, Adam Olsen wrote:

...

On 2/20/06, Aahz <aahz@pythoncraft.com> wrote:

...
On Sun, Feb 19, 2006, Josiah Carlson wrote:

...
I agree, there is nothing perfect. But at least in all of my use-cases, and the majority of the ones I've seen 'in the wild', my previous post provided an implementation that worked precisely like desired, and precisely like a regular dictionary, except when accessing a non-existant key via: value = dd[key] . __contains__, etc., all work exactly like they do with a non-defaulting dictionary. Iteration via popitem(), pop(key), items(), iteritems(), __iter__, etc., all work the way you would expect them.

This is the telling point, IMO. My company makes heavy use of a "default dict" (actually, it's a "default class" because using constants as the lookup keys is mostly what we do and the convenience of foo.bar is compelling over foo['bar']). Anyway, our semantics are as Josiah outlines, and I can't see much use case for the alternatives.

Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

Because I haven't been reading this thread all that closely, you'll have to remind me what this means.

...

...
Those of you arguing something different: do you have a real use case (that you've implemented in real code)?

(again, for the record) getorset provides the minimum needed functionality in a clean and intuitive way. Why go for a complicated solution when you simply don't need it?

Ditto above. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis

Alex Martelli

10:24 p.m.

On Feb 20, 2006, at 12:38 PM, Aahz wrote: ...

...

...
Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

Because I haven't been reading this thread all that closely, you'll have to remind me what this means.

Roughly the same (save for method/function difference) as: def getorset(d, key, func): if key not in d: d[key] = func() return d[key] Alex

Aahz

11:10 p.m.

On Mon, Feb 20, 2006, Alex Martelli wrote:

...

On Feb 20, 2006, at 12:38 PM, Aahz wrote: ...

...
...
Can you say, for the record (since nobody else seems to care), if d.getorset(key, func) would work in your use cases?

Because I haven't been reading this thread all that closely, you'll have to remind me what this means.

Roughly the same (save for method/function difference) as:

def getorset(d, key, func): if key not in d: d[key] = func() return d[key]

That has the problem of looking clumsy, and doubly so for our use case where it's an attribute-based dict. Our style relies on the clean look of code like this: if order.street: ... Even as a dict, that doesn't look horrible: if order['street']: ... OTOH, this starts looking ugly: if order.get('street'): ... And this is just plain bad: if getattr(order, 'street'): ... Note that because we have to deal with *both* the possibility that the attribute/key may not be there *and* that it might be blank -- but both are semantically equivalent for our application -- there's no other clean coding style. Now, I realize this is different from the "primary use case" for needing mutable values, but any proposed default dict solution that doesn't cleanly support my use case is less interesting to me. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis

bokr＠oz.net

8:41 a.m.

On Thu, 16 Feb 2006 13:11:49 -0800, Guido van Rossum <guido@python.org> wrote:

...

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap):

if key not in d: d[key] = [] d[key].append(value)

An alternative way to spell this uses setdefault(), but it's not very readable:

d.setdefault(key, []).append(value)

and it also suffers from creating an unnecessary list instance. (Timings were inconclusive; the approaches are within 5-10% of each other in speed.)

My conclusion is that setdefault() is a failure -- it was a well-intentioned construct, but doesn't actually create more readable code.

Google has an internal data type called a DefaultDict which gets passed a default value upon construction. Its __getitem__ method, instead of raising KeyError, inserts a shallow copy (!) of the given default value into the dict when the value is not found. So the above code, after

d = DefaultDict([])

can be written as simply

d[key].append(value)

Wouldn't it be more generally powerful to pass type or factory function to use to instantiate a default object when a missing key is encountered, e.g. d = DefaultDict(list) then d[key].append(value) but then you can also do d = DefaultDict(dict) d[key].update(a=123) or class Foo(object): pass d = DefaultDict(Foo) d[key].phone = '415-555-1212' etc. No worries about generalizing shallow copying either ;-)

...

Note that of all the possible semantics for __getitem__ that could have produced similar results (e.g. not inserting the default in the underlying dict, or not copying the default value), the chosen semantics are the only ones that makes this example work.

Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Some more design subtleties:

- "key in d" still returns False if the key isn't there - "d.get(key)" still returns None if the key isn't there - "d.default" should be a read-only attribute giving the default value

Feedback?

See above. Regards, Bengt Richter

Georg Brandl

10:10 a.m.

Guido van Rossum wrote:

...

d = DefaultDict([])

can be written as simply

d[key].append(value)

...

Feedback?

Probably a good idea, has been proposed multiple times on clpy. One good thing would be to be able to specify either a default value or a factory function. While at it, other interesting dict subclasses could be: * sorteddict, practically reinvented by every larger project * keytransformdict, such as d = keytransformdict(str.lower). Georg

Thomas Heller

10:42 a.m.

...

Guido van Rossum wrote:

...
d = DefaultDict([])

can be written as simply

d[key].append(value)

...
Feedback?

Ok, setdefault is a horrible name. Would it make sense to come up with a better name? Georg Brandl wrote:

...

Probably a good idea, has been proposed multiple times on clpy. One good thing would be to be able to specify either a default value or a factory function.

While at it, other interesting dict subclasses could be: * sorteddict, practically reinvented by every larger project

You mean ordereddict, not sorteddict, I hope.

...

* keytransformdict, such as d = keytransformdict(str.lower).

Not sure what you mean by that. What *I* would like is probably more ambitious: I want a dict that allows case-insensitive lookup of string keys, plus ideally I want to use it as class or instance dictionary. Use case: COM wrappers. Thomas

Georg Brandl

11:55 a.m.

Thomas Heller wrote:

...

...
Probably a good idea, has been proposed multiple times on clpy. One good thing would be to be able to specify either a default value or a factory function.

While at it, other interesting dict subclasses could be: * sorteddict, practically reinvented by every larger project

You mean ordereddict, not sorteddict, I hope.

Well, yes.

...

...
* keytransformdict, such as d = keytransformdict(str.lower).

Not sure what you mean by that.

What *I* would like is probably more ambitious: I want a dict that allows case-insensitive lookup of string keys

This is exactly what this would do. All keys are transformed to lowercase when setting and looking up.

...

plus ideally I want to use it as class or instance dictionary. Use case: COM wrappers.

regards, Georg

Phillip J. Eby

1:52 p.m.

At 10:10 AM 02/17/2006 +0100, Georg Brandl wrote:

...

Guido van Rossum wrote:

...
d = DefaultDict([])

can be written as simply

d[key].append(value)

...
Feedback?

Probably a good idea, has been proposed multiple times on clpy. One good thing would be to be able to specify either a default value or a factory function.

+1 on factory function, e.g. "DefaultDict(list)". A default value isn't very useful, because for immutable defaults, setdefault() works well enough. If what you want is a copy of some starting object, you can always do something like DefaultDict({1:2,3:4}.copy).

Nick Coghlan

3:27 p.m.

Phillip J. Eby wrote:

...

At 10:10 AM 02/17/2006 +0100, Georg Brandl wrote:

...
Guido van Rossum wrote:

...
d = DefaultDict([])

can be written as simply

d[key].append(value) Feedback? Probably a good idea, has been proposed multiple times on clpy. One good thing would be to be able to specify either a default value or a factory function.

+1 on factory function, e.g. "DefaultDict(list)". A default value isn't very useful, because for immutable defaults, setdefault() works well enough. If what you want is a copy of some starting object, you can always do something like DefaultDict({1:2,3:4}.copy).

+1 here, too (for permitting a factory function only). This doesn't really limit usage, as you can still supply DefaultDict(partial(copy, x)) or DefaultDict(partial(deepcopy, x)), or (heaven forbid) a lambda expression. . . As others have mentioned, the basic types are all easy, since the typename can be used directly. +1 on supplying that factory function to the constructor, too (the default value is a fundamental part of the defaultdict). That is, I'd prefer: d = defaultdict(func) # The defaultdict is fully defined, but not yet populated d.update(init_values) over: d = defaultdict(init_values) # The defaultdict is partially populated, but not yet fully defined! d.default(func) That is, something that is the same the normal dict except for: def __init__(self, default): self.default = default def __getitem__(self, key): return self.get(key, self.default()) Considering some of Raymond's questions in light of the above

...

* implications of a __getitem__ succeeding while get(value, x) returns x (possibly different from the overall default) * implications of a __getitem__ succeeding while __contains__ would fail

These behaviours seem reasonable for a default dictionary - "containment" is based on whether or not the key actually exists in the dictionary as it currently stands, and the default is really a "default default" that can be overridden using 'get'.

...

* whether to add this to the collections module (I would say yes) * whether to allow default functions as well as default values (so you could instantiate a new default list)

My preference is for factory functions only, to eliminate ambiguity. # bag like behavior dd = collections.default_dict(int) for elem in collection: dd[elem] += 1 # setdefault-like behavior dd = collections.default_dict(list) for page_number, page in enumerate(book): for word in page.split(): dd[word].append(word) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Walter Dörwald

10:31 a.m.

Guido van Rossum wrote:

...

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap):

if key not in d: d[key] = [] d[key].append(value)

An alternative way to spell this uses setdefault(), but it's not very readable:

d.setdefault(key, []).append(value)

and it also suffers from creating an unnecessary list instance. (Timings were inconclusive; the approaches are within 5-10% of each other in speed.)

My conclusion is that setdefault() is a failure -- it was a well-intentioned construct, but doesn't actually create more readable code.

Google has an internal data type called a DefaultDict which gets passed a default value upon construction. Its __getitem__ method, instead of raising KeyError, inserts a shallow copy (!) of the given default value into the dict when the value is not found. So the above code, after

d = DefaultDict([])

can be written as simply

d[key].append(value)

Using a shallow copy of the default seems a bit too magical to me. How would this be done? Via copy.copy? And passing [] to the constructor of dict has a different meaning already. Fetching the default via a static/class method would solve both problems: class default_dict(dict): def __getitem__(self, key): if key in self: return dict.__getitem__(self, key) else: default = self.getdefault() self[key] = default return default class multi_map(default_dict): @staticmethod def getdefault(self): return [] class counting_dict(default_dict): @staticmethod def getdefault(self): return 0

...

[...]

Bye, Walter Dörwald

Fredrik Lundh

12:14 p.m.

Guido van Rossum wrote:

...

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap):

if key not in d: d[key] = [] d[key].append(value)

/.../

...

Feedback?

+1. check it in, already (as collections.defaultdict, perhaps?) alternatively, you could specialize even further: collections.multimap, which deals with list values only (that shallow copy thing feels a bit questionable, but all alternatives feel slightly overgeneralized...) </F>

Raymond Hettinger

1:19 p.m.

...

My conclusion is that setdefault() is a failure -- it was a well-intentioned construct, but doesn't actually create more readable code.

It was an across the board failure: naming, clarity, efficiency. Can we agree to slate dict.setdefault() to disappear in Py3.0? Raymond

Adam Olsen

3:13 p.m.

On 2/16/06, Guido van Rossum <guido@python.org> wrote:

...

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap):

if key not in d: d[key] = [] d[key].append(value)

An alternative way to spell this uses setdefault(), but it's not very readable:

d.setdefault(key, []).append(value)

I'd like to see it done passing a factory function (and with a better name): d.getorset(key, list).append(value) The name is slightly odd but it is effective. Plus it avoids creating a new class when a slight tweak to an existing one will do.

...

Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

-1 (atleast until you can explain why that's better than .getorset()) -- Adam Olsen, aka Rhamphoryncus

Nick Coghlan

3:46 p.m.

Adam Olsen wrote:

...

...
Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

-1 (atleast until you can explain why that's better than .getorset())

Because the "default default" is a fundamental characteristic of the default dictionary (meaning it works with normal indexing syntax), whereas "getorset" makes it a characteristic of the method call. Besides, if there are going to be any method changes on normal dicts, I'd rather see a boolean third argument "set" to the get method. That is (for a normal dict): def get(self, key, *args): set = False no_default = False if len(args) == 2: default, set = args elif args: default, = args else: no_default = True if key in self: return self[key] if no_default: raise KeyError(repr(key)) if set: self[key] = default return default Using Guido's original example: d.get(key, [], True).append(value) I don't really think this is a replacement for defaultdict, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Fredrik Lundh

3:50 p.m.

Nick Coghlan wrote:

...

Using Guido's original example:

d.get(key, [], True).append(value)

hmm. are you sure you didn't just reinvent setdefault ? </F>

Nick Coghlan

4:28 p.m.

Fredrik Lundh wrote:

...

Nick Coghlan wrote:

...
Using Guido's original example:

d.get(key, [], True).append(value)

hmm. are you sure you didn't just reinvent setdefault ?

I'm reasonably sure I copied it on purpose, only with a name that isn't 100% misleading as to what it does ;) I think collections.defaultdict is a better approach, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Gareth McCaughan

March 2006

2:32 p.m.

...

...
...
d.get(key, [], True).append(value)

hmm. are you sure you didn't just reinvent setdefault ?

I'm reasonably sure I copied it on purpose, only with a name that isn't 100% misleading as to what it does ;)

Heh. From the original Usenet posting that suggested the capability that was added in the form of "setdefault": | I suggest a minor change: another optional argument to | "get" so that | | dict.get(item,default,flag) | | is equivalent to | | if dict.has_key(item): | VALUE IS dict[item] | else: | if flag: dict[item] = default <-- This is all that's new | VALUE IS default | | but presumably more efficient. The response was a chorus of people saying "Not a bad idea, but that flag sucks. It needs a separate method." :-) -- g

Jack Diederich

February 2006

5:19 p.m.

On Thu, Feb 16, 2006 at 01:11:49PM -0800, Guido van Rossum wrote: [snip]

...

Google has an internal data type called a DefaultDict which gets passed a default value upon construction. Its __getitem__ method, instead of raising KeyError, inserts a shallow copy (!) of the given default value into the dict when the value is not found. So the above code, after

d = DefaultDict([])

can be written as simply

d[key].append(value)

Note that of all the possible semantics for __getitem__ that could have produced similar results (e.g. not inserting the default in the underlying dict, or not copying the default value), the chosen semantics are the only ones that makes this example work.

Having __getitem__ insert the returned default value allows it to work with a larger variety of classes. My own ForgivingDict does not do this and works fine for ints and lists but not much else. fd = ForgivingDict(list) fd[key] += [val] # extends the list and does a __setitem__ The += operator isn't useful for dicts. How can you make a defaultdict with a defaultdict as the default? My head asploded when I tried it with the constructor arg. It does seem possible with the 'd.default = func' syntax # empty defaultdict constructor d = defaultdict() d.default = d tree = defaultdict() tree.default = d.copy -jackdied

CM

5:27 p.m.

+1 It's about time! - C On 2/16/06, Guido van Rossum <guido@python.org> wrote:

...

A bunch of Googlers were discussing the best way of doing the following (a common idiom when maintaining a dict of lists of values relating to a key, sometimes called a multimap):

if key not in d: d[key] = [] d[key].append(value)

An alternative way to spell this uses setdefault(), but it's not very readable:

d.setdefault(key, []).append(value)

and it also suffers from creating an unnecessary list instance. (Timings were inconclusive; the approaches are within 5-10% of each other in speed.)

My conclusion is that setdefault() is a failure -- it was a well-intentioned construct, but doesn't actually create more readable code.

Google has an internal data type called a DefaultDict which gets passed a default value upon construction. Its __getitem__ method, instead of raising KeyError, inserts a shallow copy (!) of the given default value into the dict when the value is not found. So the above code, after

d = DefaultDict([])

can be written as simply

d[key].append(value)

Note that of all the possible semantics for __getitem__ that could have produced similar results (e.g. not inserting the default in the underlying dict, or not copying the default value), the chosen semantics are the only ones that makes this example work.

Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Some more design subtleties:

- "key in d" still returns False if the key isn't there - "d.get(key)" still returns None if the key isn't there - "d.default" should be a read-only attribute giving the default value

Feedback?

-- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/monpublic%40gmail.com

-- "A programmer learning programming from Perl is like a chemistry student learning the definition of 'exothermic' with dynamite." - evilpenguin

Guido van Rossum

8:09 p.m.

On 2/16/06, Guido van Rossum <guido@python.org> wrote:

...

Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Thanks for all the constructive feedback. Here are some responses and a new proposal. - Yes, I'd like to kill setdefault() in 3.0 if not sooner. - It would indeed be nice if this was an optional feature of the standard dict type. - I'm ignoring the request for other features (ordering, key transforms). If you want one of these, write a PEP! - Many, many people suggested to use a factory function instead of a default value. This is indeed a much better idea (although slightly more cumbersome for the simplest cases). - Some people seem to think that a subclass constructor signature must match the base class constructor signature. That's not so. The subclass constructor must just be careful to call the base class constructor with the correct arguments. Think of the subclass constructor as a factory function. - There's a fundamental difference between associating the default value with the dict object, and associating it with the call. So proposals to invent a better name/signature for setdefault() don't compete. (As to one specific such proposal, adding an optional bool as the 3rd argument to get(), I believe I've explained enough times in the past that flag-like arguments that always get a constant passed in at the call site are a bad idea and should usually be refactored into two separate methods.) - The inconsistency introduced by __getitem__() returning a value for keys while get(), __contains__(), and keys() etc. don't show it, cannot be resolved usefully. You'll just have to live with it. Modifying get() to do the same thing as __getitem__() doesn't seem useful -- it just takes away a potentially useful operation. So here's a new proposal. Let's add a generic missing-key handling method to the dict class, as well as a default_factory slot initialized to None. The implementation is like this (but in C): def on_missing(self, key): if self.default_factory is not None: value = self.default_factory() self[key] = value return value raise KeyError(key) When __getitem__() (and *only* __getitem__()) finds that the requested key is not present in the dict, it calls self.on_missing(key) and returns whatever it returns -- or raises whatever it raises. __getitem__() doesn't need to raise KeyError any more, that's done by on_missing(). The on_missing() method can be overridden to implement any semantics you want when the key isn't found: return a value without inserting it, insert a value without copying it, only do it for certain key types/values, make the default incorporate the key, etc. But the default implementation is designed so that we can write d = {} d.default_factory = list to create a dict that inserts a new list whenever a key is not found in __getitem__(), which is most useful in the original use case: implementing a multiset so that one can write d[key].append(value) to add a new key/value to the multiset without having to handle the case separately where the key isn't in the dict yet. This also works for sets instead of lists: d = {} d.default_factory = set ... d[key].add(value) I went through several iterations to obtain this design; my first version of on_missing() would just raise KeyError(key), requiring you to always provide a subclass; this is more minimalistic but less useful and would probably raise the bar for using the feature to some extent. To saev you attempts to simplify this, here are some near-misses I considered that didn't quite work out: - def on_missing(self, key): if self.default_factory is not None: return self.default_factory() raise KeyError(key) This would require the multiset example to subclass, since default_factory doesn't see the key so it can't insert it. - def on_missing(self, key): if self.default_factory is not None: return self.default_factory(key) raise KeyError(key) This appears to fix that problem, but now you can't write "d.default_value = list" since (a) list(key) doesn't return an empty list, and (b) it also doesn't insert the key into the dict; attempting to assign a callback function to default_factory that solves these issues fail because the callback doesn't have access to the dict instance (unless there's only one). - Do away with on_missing() and just include its body at the end of __getitem__(), to be invoked when the key isn't found. This is less general in case you want different default semantics (e.g. not inserting the default, or making the default a function of the key) -- you'd have to override __getitem__() for that, which means you'd be paying overhead even for keys that *are* present. I'll try to cook up an implementation on SF after I've dug myself out of the day's email barrage. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Fred L. Drake, Jr.

8:26 p.m.

On Friday 17 February 2006 14:09, Guido van Rossum wrote:

...

So here's a new proposal.

I like the version you came up with. It has sufficient functionality to make it easy to use, and enough flexibility to be useful in more specialized cases. I'm quite certain it would handle all the cases I've actually dealt with where I wanted a variation of a mapping with default values. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>

Thomas Heller

9:04 p.m.

Guido van Rossum wrote:

...

So here's a new proposal.

Let's add a generic missing-key handling method to the dict class, as well as a default_factory slot initialized to None. The implementation is like this (but in C):

def on_missing(self, key): if self.default_factory is not None: value = self.default_factory() self[key] = value return value raise KeyError(key)

When __getitem__() (and *only* __getitem__()) finds that the requested key is not present in the dict, it calls self.on_missing(key) and returns whatever it returns -- or raises whatever it raises. __getitem__() doesn't need to raise KeyError any more, that's done by on_missing().

Will this also work when PyDict_GetItem() does not find the key? Thomas

Ian Bicking

10:03 p.m.

Guido van Rossum wrote:

...

d = {} d.default_factory = set ... d[key].add(value)

Another option would be: d = {} d.default_factory = set d.get_default(key).add(value) Unlike .setdefault, this would use a factory associated with the dictionary, and no default value would get passed in. Unlike the proposal, this would not override __getitem__ (not overriding __getitem__ is really the only difference with the proposal). It would be clear reading the code that you were not implicitly asserting they "key in d" was true. "get_default" isn't the best name, but another name isn't jumping out at me at the moment. Of course, it is not a Pythonic argument to say that an existing method should be overridden, or functionality made nameless simply because we can't think of a name (looking to anonymous functions of course ;) -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

Guido van Rossum

11:15 p.m.

On 2/17/06, Ian Bicking <ianb@colorstudy.com> wrote:

...

Guido van Rossum wrote:

...
d = {} d.default_factory = set ... d[key].add(value)

Another option would be:

d = {} d.default_factory = set d.get_default(key).add(value)

Unlike .setdefault, this would use a factory associated with the dictionary, and no default value would get passed in. Unlike the proposal, this would not override __getitem__ (not overriding __getitem__ is really the only difference with the proposal). It would be clear reading the code that you were not implicitly asserting they "key in d" was true.

"get_default" isn't the best name, but another name isn't jumping out at me at the moment. Of course, it is not a Pythonic argument to say that an existing method should be overridden, or functionality made nameless simply because we can't think of a name (looking to anonymous functions of course ;)

I'm torn. While trying to implement this I came across some ugliness in PyDict_GetItem() -- it would make sense if this also called on_missing(), but it must return a value without incrementing its refcount, and isn't supposed to raise exceptions -- so what to do if on_missing() returns a value that's not inserted in the dict? If the __getattr__()-like operation that supplies and inserts a dynamic default was a separate method, we wouldn't have this problem. OTOH most reviewers here seem to appreciate on_missing() as a way to do various other ways of alterning a dict's __getitem__() behavior behind a caller's back -- perhaps it could even be (ab)used to implement case-insensitive lookup. I'm not going to do a point-by-point to your longer post (I don't have the time); let's (again) agree to disagree and I'll sleep on it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

"Martin v. Löwis"

11:37 p.m.

Guido van Rossum wrote:

...

I'm torn. While trying to implement this I came across some ugliness in PyDict_GetItem() -- it would make sense if this also called on_missing(), but it must return a value without incrementing its refcount, and isn't supposed to raise exceptions -- so what to do if on_missing() returns a value that's not inserted in the dict?

I think there should be a guideline to use PyObject_GetItem/PyMapping_GetItemString "normally", i.e. in all cases where you would write d[k] in Python code. It should be considered a bug if PyDict_GetItem is used in a place that "should" invoke defaulting; IOW, the function should be reserved to really low-level cases (e.g. if it is known that the dict doesn't have any defaulting, e.g. the string interned dictionary). There should be a policy whether name-lookup invokes defaulting (i.e. __dict__ access); I think it should. This would cause __getattr__ to have no effect if the object's dictionary has a default factory (unless that raises a KeyError). Regards, Martin

Walter Dörwald

10:44 a.m.

Guido van Rossum wrote:

...

On 2/17/06, Ian Bicking <ianb@colorstudy.com> wrote:

...
Guido van Rossum wrote:

...
d = {} d.default_factory = set ... d[key].add(value)

Another option would be:

d = {} d.default_factory = set d.get_default(key).add(value)

Unlike .setdefault, this would use a factory associated with the dictionary, and no default value would get passed in. Unlike the proposal, this would not override __getitem__ (not overriding __getitem__ is really the only difference with the proposal). It would be clear reading the code that you were not implicitly asserting they "key in d" was true.

"get_default" isn't the best name, but another name isn't jumping out at me at the moment. Of course, it is not a Pythonic argument to say that an existing method should be overridden, or functionality made nameless simply because we can't think of a name (looking to anonymous functions of course ;)

I'm torn. While trying to implement this I came across some ugliness in PyDict_GetItem() -- it would make sense if this also called on_missing(), but it must return a value without incrementing its refcount, and isn't supposed to raise exceptions -- so what to do if on_missing() returns a value that's not inserted in the dict?

If the __getattr__()-like operation that supplies and inserts a dynamic default was a separate method, we wouldn't have this problem.

OTOH most reviewers here seem to appreciate on_missing() as a way to do various other ways of alterning a dict's __getitem__() behavior behind a caller's back -- perhaps it could even be (ab)used to implement case-insensitive lookup.

I don't like the fact that on_missing()/default_factory can change the behaviour of __getitem__, which upto now has been something simple and understandable. Why don't we put the on_missing()/default_factory functionality into get() instead? d.get(key, default) does what it did before. d.get(key) invokes on_missing() (and dict would have default_factory == type(None)) Bye, Walter Dörwald

bokr＠oz.net

3:47 a.m.

On Sat, 18 Feb 2006 10:44:15 +0100 (CET), "=?iso-8859-1?Q?Walter_D=F6rwald?=" <walter@livinglogic.de> wrote:

...

Guido van Rossum wrote:

...
On 2/17/06, Ian Bicking <ianb@colorstudy.com> wrote:

...
Guido van Rossum wrote:

...
d =3D {} d.default_factory =3D set ... d[key].add(value)

Another option would be:

d =3D {} d.default_factory =3D set d.get_default(key).add(value)

Unlike .setdefault, this would use a factory associated with the diction= ary, and no default value would get passed in. Unlike the proposal, this would not override __getitem__ (not overriding __getitem__ is really the only difference with the proposal). It would = be clear reading the code that you were not implicitly asserting they "key in d" was true.

"get_default" isn't the best name, but another name isn't jumping out at= me at the moment. Of course, it is not a Pythonic argument to say that an existing method should be overridden, or functio= nality made nameless simply because we can't think of a name (looking to anonymous functions of course ;)

I'm torn. While trying to implement this I came across some ugliness in P= yDict_GetItem() -- it would make sense if this also called on_missing(), but it must return a value without incrementing its refcount, and isn't supposed to raise exceptions -- so what to do if on_m= issing() returns a value that's not inserted in the dict?

If the __getattr__()-like operation that supplies and inserts a dynamic default was a separate method, we wouldn't have this problem.

OTOH most reviewers here seem to appreciate on_missing() as a way to do v= arious other ways of alterning a dict's __getitem__() behavior behind a caller's back -- perhaps it could even be= (ab)used to implement case-insensitive lookup.

I don't like the fact that on_missing()/default_factory can change the beha= viour of __getitem__, which upto now has been something simple and understandable. Why don't we put the on_missing()/default_factory functionality into get() = instead?

d.get(key, default) does what it did before. d.get(key) invokes on_missing(= ) (and dict would have default_factory =3D=3D type(None))

OTOH, I forgot why it was desirable in the first place to overload d[k] with defaulting logic. E.g., why wouldn't d.defaulting[k] be ok to write when you want the d.default_factory action? on_missing feels more like a tracing hook though, so maybe it could always act either way if defined. Also, for those wanting to avoid lambda:42 as factory, would a callable test cost a lot? Of course then the default_factory name might require revision. Regards, Bengt Richter

Michael Hudson

12:44 p.m.

"Guido van Rossum" <guido@python.org> writes:

...

I'm torn. While trying to implement this I came across some ugliness in PyDict_GetItem() -- it would make sense if this also called on_missing(), but it must return a value without incrementing its refcount, and isn't supposed to raise exceptions

This last bit has been a painful lie for quite some time. I don't know what can be done about it, though -- avoid the use of PyDict_GetItem() in situations where you don't expect string only dicts (so using it on globals and instance dicts would still be ok)?

...

-- so what to do if on_missing() returns a value that's not inserted in the dict?

Well, like some others I am a bit uncomfortable with changing the semantics of such an important operation on such an important data structure. But then I'm also not that unhappy with setdefault, so I must be weird.

...

If the __getattr__()-like operation that supplies and inserts a dynamic default was a separate method, we wouldn't have this problem.

Yes.

...

OTOH most reviewers here seem to appreciate on_missing() as a way to do various other ways of alterning a dict's __getitem__() behavior behind a caller's back -- perhaps it could even be (ab)used to implement case-insensitive lookup.

Well, I'm not sure I do. There seems to be quite a conceptual difference between being able to make a new kind of dictionary and mess with the behaviour of one that exists already, but I don't know if that matters in practice (the fact that you can currently do things like "import sys; sys.__dict__.clear()" doesn't seem to cause real problems). Finally, I'll just note that subclassing to modify the behaviour of a builtin type has generally been actively discouraged in python so far. If all dictionary lookups went through a method that you could override in Python (i.e. subclasses could replace ma_lookup, in effect) this would be easy to do in Python code. But they don't, and bug reports suggesting that they do have been rejected in the past (and I agree with the rejection, fwiw). So that rambled a bit. But in essence: I'd much prefer much prefer an addtion of a method or a type than modifictaion of existing behaviour. Cheers, mwh -- If you're talking "useful", I'm not your bot. -- Tim Peters, 08 Nov 2001

Bernhard Herzog

10:41 p.m.

"Guido van Rossum" <guido@python.org> writes:

...

If the __getattr__()-like operation that supplies and inserts a dynamic default was a separate method, we wouldn't have this problem.

Why implement it in the dictionary type at all? If, for intance, the default value functionality were provided as a decorator, it could be used with all kinds of mappings. I.e. you could have something along these lines: class defaultwrapper(object): def __init__(self, base, factory): self.__base = base self.__factory = factory def __getitem__(self, key): try: return self.__base[key] except KeyError: value = self.__factory() self.__base[key] = value return value def __getattr__(self, attr): return getattr(self.__base, attr) def test(): dd = defaultwrapper({}, list) dd["abc"].append(1) dd["abc"].append(2) dd["def"].append(1) assert sorted(dd.keys()) == ["abc", "def"] assert sorted(dd.values()) == [[1], [1, 2]] assert sorted(dd.items()) == [("abc", [1, 2]), ("def", [1])] assert dd.has_key("abc") assert not dd.has_key("xyz") The precise semantics would have to be determined yet, of course.

...

OTOH most reviewers here seem to appreciate on_missing() as a way to do various other ways of alterning a dict's __getitem__() behavior behind a caller's back -- perhaps it could even be (ab)used to implement case-insensitive lookup.

case-insensitive lookup could be implemented with another wrapper/decorator. If you need both case-insitivity and a default value, you can easily stack the decorators. Bernhard -- Intevation GmbH http://intevation.de/ Skencil http://skencil.org/ Thuban http://thuban.intevation.org/

Aahz

10:18 p.m.

On Fri, Feb 17, 2006, Guido van Rossum wrote:

...

But the default implementation is designed so that we can write

d = {} d.default_factory = list

+1 I actually like the fact that you're forced to use a separate statement for setting the default_factory. From my POV, this can go into 2.5. (I was only +0 on the previous proposal and I was -1 on making it a built-in; this extension is much nicer.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis

Adam Olsen

10:54 p.m.

On 2/17/06, Guido van Rossum <guido@python.org> wrote:

...

- There's a fundamental difference between associating the default value with the dict object, and associating it with the call. So proposals to invent a better name/signature for setdefault() don't compete.

That's a feature, not a bug. :) See below.

...

- The inconsistency introduced by __getitem__() returning a value for keys while get(), __contains__(), and keys() etc. don't show it, cannot be resolved usefully. You'll just have to live with it. Modifying get() to do the same thing as __getitem__() doesn't seem useful -- it just takes away a potentially useful operation.

Again, see below.

...

So here's a new proposal.

Let's add a generic missing-key handling method to the dict class, as well as a default_factory slot initialized to None. The implementation is like this (but in C):

def on_missing(self, key): if self.default_factory is not None: value = self.default_factory() self[key] = value return value raise KeyError(key)

When __getitem__() (and *only* __getitem__()) finds that the requested key is not present in the dict, it calls self.on_missing(key) and returns whatever it returns -- or raises whatever it raises. __getitem__() doesn't need to raise KeyError any more, that's done by on_missing().

Still -1. It's better, but it violates the principle of encapsulation by mixing how-you-use-it state with what-it-stores state. In doing that it has the potential to break an API documented as accepting a dict. Code that expects d[key] to raise an exception (and catches the resulting KeyError) will now silently "succeed". I believe that necessitates a PEP to document it. It's also makes it harder to read code. You may expect d[key] to raise an exception, but it won't because of a single line up several pages (or in another file entierly!) d.getorset(key, func) has no such problems and has a much simpler specification: def getorset(self, key, func): try: return self[key] except KeyError: value = self[key] = func() return value -- Adam Olsen, aka Rhamphoryncus

Guido van Rossum

11:08 p.m.

On 2/17/06, Adam Olsen <rhamph@gmail.com> wrote:

...

It's also makes it harder to read code. You may expect d[key] to raise an exception, but it won't because of a single line up several pages (or in another file entierly!)

Such are the joys of writing polymorphic code. I don't really see how you can avoid this kind of confusion -- I could have given you some other mapping object that does weird stuff. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Ian Bicking

11:38 p.m.

Guido van Rossum wrote:

...

On 2/17/06, Adam Olsen <rhamph@gmail.com> wrote:

...
It's also makes it harder to read code. You may expect d[key] to raise an exception, but it won't because of a single line up several pages (or in another file entierly!)

Such are the joys of writing polymorphic code. I don't really see how you can avoid this kind of confusion -- I could have given you some other mapping object that does weird stuff.

The way you avoid confusion is by not working with code or programmers who write bad code. Python and polymorphic code in general pushes the responsibility for many errors from the language structure onto the programmer -- it is the programmers' responsibility to write good code. Python has never kept people from writing obcenely horrible code. We ought to have an obfuscated Python contest just to prove that point -- it is through practice and convention that readable Python code happens, not through the restrictions of the language. (Honestly, I think such a contest would be a good idea.) I know *I* at least don't like code that mixes up access and modification. Maybe not everyone does (or maybe not everyone thinks of getitem as "access", but that's unlikely). I will assert that it is Pythonic to keep access and modification separate, which is why methods and attributes are different things, and why assignment is not an expression, and why functions with side effects typically return None, or have names that are very explicit about the side effect, with names containing command verbs like "update" or "set". All of these distinguish access from modification. Note that all of what I'm saying *only* applies to the overriding of __getitem__, not the addition of any new method. I think multidict is better for the places it applies, but I see no problem at all with a new method on dictionaries that calls on_missing. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

"Martin v. Löwis"

12:06 a.m.

Ian Bicking wrote:

...

I know *I* at least don't like code that mixes up access and modification. Maybe not everyone does (or maybe not everyone thinks of getitem as "access", but that's unlikely). I will assert that it is Pythonic to keep access and modification separate, which is why methods and attributes are different things, and why assignment is not an expression, and why functions with side effects typically return None, or have names that are very explicit about the side effect, with names containing command verbs like "update" or "set". All of these distinguish access from modification.

Do you never write d[some_key].append(some_value) This is modification and access, all in a single statement, and all without assignment operator. I don't see the setting of the default value as a modification. The default value has been there, all the time. It only is incarnated lazily. Regards, Martin

Ian Bicking

12:21 a.m.

Martin v. Löwis wrote:

...

...
I know *I* at least don't like code that mixes up access and modification. Maybe not everyone does (or maybe not everyone thinks of getitem as "access", but that's unlikely). I will assert that it is Pythonic to keep access and modification separate, which is why methods and attributes are different things, and why assignment is not an expression, and why functions with side effects typically return None, or have names that are very explicit about the side effect, with names containing command verbs like "update" or "set". All of these distinguish access from modification.

Do you never write

d[some_key].append(some_value)

This is modification and access, all in a single statement, and all without assignment operator.

(d[some_key]) is access. (...).append(some_value) is modification. Expressions are compound; of course you can mix both access and modification in a single expression. d[some_key] is access that returns something, and .append(some_value) modifies that something, it doesn't modify d.

...

I don't see the setting of the default value as a modification. The default value has been there, all the time. It only is incarnated lazily.

It is lazily incarnated for multidict, because there is no *noticeable* side effect -- if there is any internal side effects that is an implementation detail. However for default_factory=list, the result of .keys(), .has_key(), and .items() changes when you do d[some_key]. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

"Martin v. Löwis"

1:12 a.m.

Ian Bicking wrote:

...

It is lazily incarnated for multidict, because there is no *noticeable* side effect -- if there is any internal side effects that is an implementation detail. However for default_factory=list, the result of .keys(), .has_key(), and .items() changes when you do d[some_key].

That's why I think has_key and in should return True for any key. This leaves keys(), items(), and values(). From a pure point of view, they should return infinite sets. Practicality beats purity, so yes, d[k] could be considered a modifying operation. If you look carefully, you find that many access operations also have side effects. For example, .read() on a file not only returns some data, but also advances the file position. Queue.get not only returns the next item, but also removes it from the queue. Regards, Martin

Guido van Rossum

1:20 a.m.

On 2/17/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

That's why I think has_key and in should return True for any key. This leaves keys(), items(), and values(). From a pure point of view, they should return infinite sets. Practicality beats purity, so yes, d[k] could be considered a modifying operation.

I think practicality beats purity even for has_key/in; IMO these operations are more useful when they match keys() instead of always returning True. But someone should start writing some code to play with this. I have a working patch (including a hack for PyDict_GetItem()): python.org/sf/1433928 So there's no excuse to be practical now. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Ewing

2:50 a.m.

Would people perhaps feel better if defaultdict *wasn't* a subclass of dict, but a distinct mapping type of its own? That would make it clearer that it's not meant to be a drop-in replacement for a dict in arbitrary contexts. Greg

Raymond Hettinger

3:10 a.m.

[Greg Ewing]

...

Would people perhaps feel better if defaultdict *wasn't* a subclass of dict, but a distinct mapping type of its own? That would make it clearer that it's not meant to be a drop-in replacement for a dict in arbitrary contexts.

Absolutely. That's the right way to avoid Liskov violations from altered invariants and API changes. Besides, with Python's propensity for duck typing, there's no reason to subclass when we don't have to. Raymond

Adam Olsen

11:56 p.m.

On 2/17/06, Guido van Rossum <guido@python.org> wrote:

...

On 2/17/06, Adam Olsen <rhamph@gmail.com> wrote:

...
It's also makes it harder to read code. You may expect d[key] to raise an exception, but it won't because of a single line up several pages (or in another file entierly!)

Such are the joys of writing polymorphic code. I don't really see how you can avoid this kind of confusion -- I could have given you some other mapping object that does weird stuff.

You could pass a float in as well. But if the function is documented as taking a dict, and the programmer expects a dict.. that now has to be changed to "dict without a default". Or they have to code defensively since d[key] may or may not raise KeyError, so they must avoid depending on it either way. -- Adam Olsen, aka Rhamphoryncus

Guido van Rossum

midnight

On 2/17/06, Adam Olsen <rhamph@gmail.com> wrote:

...

...
Such are the joys of writing polymorphic code. I don't really see how you can avoid this kind of confusion -- I could have given you some other mapping object that does weird stuff.

You could pass a float in as well. But if the function is documented as taking a dict, and the programmer expects a dict.. that now has to be changed to "dict without a default". Or they have to code defensively since d[key] may or may not raise KeyError, so they must avoid depending on it either way.

I'd like to see a real-life example of code that would break this way. I believe that *most* code that takes a dict will work just fine if that dict has a default factory. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

"Martin v. Löwis"

12:07 a.m.

Adam Olsen wrote:

...

You could pass a float in as well. But if the function is documented as taking a dict, and the programmer expects a dict.. that now has to be changed to "dict without a default". Or they have to code defensively since d[key] may or may not raise KeyError, so they must avoid depending on it either way.

Can you give an example of real, existing code that will break if a such a dict is passed? Regards, Martin

Adam Olsen

2:29 a.m.

On 2/17/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Adam Olsen wrote:

...
You could pass a float in as well. But if the function is documented as taking a dict, and the programmer expects a dict.. that now has to be changed to "dict without a default". Or they have to code defensively since d[key] may or may not raise KeyError, so they must avoid depending on it either way.

Can you give an example of real, existing code that will break if a such a dict is passed?

I only got halfway through the "grep KeyError" results, but.. Demo/metaclass/Meta.py:55 Demo/tkinter/guido/AttrDialog.py:121 # Subclasses override self.classes Lib/ConfigParser.py:623 Lib/random.py:315 Lib/string.py:191 Lib/weakref.py:56 # Currently uses UserDict but I assume it will switch to dict eventually And the pièce de résistance.. Doc/tools/anno-api.py:51 It has this: try: info = rcdict[s] except KeyError: sys.stderr.write("No refcount data for %s\n" % s) else: ... rcdict is loaded from refcounts.load(). refcounts.load() calls refcounts.loadfile(), which has this (inside a loop): try: entry = d[function] except KeyError: entry = d[function] = Entry(function) A prime candidate for a default. Perhaps the KeyError shouldn't ever get triggered in this case, I'm not sure. I think that's besides the point though. The programmer clearly expected it would. -- Adam Olsen, aka Rhamphoryncus

Nick Coghlan

4:34 a.m.

Adam Olsen wrote:

...

And the pièce de résistance.. Doc/tools/anno-api.py:51

It has this: try: info = rcdict[s] except KeyError: sys.stderr.write("No refcount data for %s\n" % s) else: ... rcdict is loaded from refcounts.load(). refcounts.load() calls refcounts.loadfile(), which has this (inside a loop): try: entry = d[function] except KeyError: entry = d[function] = Entry(function) A prime candidate for a default.

Perhaps the KeyError shouldn't ever get triggered in this case, I'm not sure. I think that's besides the point though. The programmer clearly expected it would.

Assuming the following override: class EntryDict(dict): def on_missing(self, key): value = Entry(key) self[key] = value return value Then what it means is that the behaviour of "missing functions get an empty refcount entry" propagates to the rcdict code. So the consequence is that the code in anno-api will never print an error message - all functions are deemed to have associated refcount data in refcount.dat. But that would be a bug in refcounts.loadfile: if it returns an EntryDict instead of a normal dict it is, in effect, returning an *infinite* dictionary that contains refcount definitions for every possible function name (some of them are just populated on demand). So *if* refcounts.loadfile was converted to use an EntryDict, it would need to return dict(d) instead of returning d directly. And this is where the question of whether has_key/__having__ return True or False when default_factory is set is important. If they return False, then the LBYL (if key in d:) and EAFTP (try/except) approaches give *different answers*. More importantly, LBYL will never have side effects, whereas EAFTP may. If the methods always return True (as Martin suggests), then we retain the current behaviour where there is no real difference between the two approaches. Given the amount of time spent in recent years explaining this fact, I don't think it is an equivalence that should be broken lightly (IOW, I've persuaded myself that I agree with Martin) The alternative would be to have an additional query API "will_default" that reflects whether or not a given key is actually present in the dictionary ("if key not in d.keys()" would serve a similar purpose, but requires building the list of keys). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Guido van Rossum

5:14 a.m.

On 2/17/06, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

And this is where the question of whether has_key/__having__ return True or False when default_factory is set is important. If they return False, then the LBYL (if key in d:) and EAFTP (try/except) approaches give *different answers*.

More importantly, LBYL will never have side effects, whereas EAFTP may.

If the methods always return True (as Martin suggests), then we retain the current behaviour where there is no real difference between the two approaches. Given the amount of time spent in recent years explaining this fact, I don't think it is an equivalence that should be broken lightly (IOW, I've persuaded myself that I agree with Martin)

The alternative would be to have an additional query API "will_default" that reflects whether or not a given key is actually present in the dictionary ("if key not in d.keys()" would serve a similar purpose, but requires building the list of keys).

Looking at it from the "which invariants hold" POV isn't always the right perspective. Reality is that some amount of code that takes a dict won't work if you give it a dict with a default_factory. Well, that's nothing new. Some code also breaks if you pass it a dict containing key or value types it doesn't expect, or if you pass it an anydbm instance, or os.environ on Windows (which implements case-insensitive keys).

...

From the POV of someone who decides to use a dict with a default_factory (or overriding on-missing()), having the 'in' operator always return True is d*mn annoying -- it means that any kind of introspection of the dict doesn't work. Take for example the multiset use case. Suppose you're aware that you're using a dict with this special behavior. Now you've built up your multiset and now you want to use it. Part of your app is interested in knowing the list of values associated with each key. But another part may be interested only in whether a particular key hs *any* values associated. If "key in d" returns whether that key is currently present, you can write

if key in d: print "whatever" But under Martin and your proposed semantics, you'd have to write if d.get(key): print "whatever" or (worse) if d[key]: # inserts an empty list into the dict! print "whatever" I'd much rather be able to write "if key in d" and get the result I want... -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Nick Coghlan

8:23 a.m.

Guido van Rossum wrote:

...

I'd much rather be able to write "if key in d" and get the result I want...

Somewhere else in this byzantine thread, I realised that what was really bothering me was the conditional semantics that dict ended up with (i.e., it's behaviour changed significantly if the default factory was set). If we go back to your idea of collection.defaultdict (or Alex's name collection.autodict), then the change in semantics bothers me a lot less, and I'd be all in favour of the more usual variant (where "key in d" implies "key in d.keys()". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

"Martin v. Löwis"

9:08 a.m.

Adam Olsen wrote:

...

Demo/metaclass/Meta.py:55

That wouldn't break. If you had actually read the code, you would have seen it is try: ga = dict['__getattr__'] except KeyError: pass How would it break if dict had a default factory? ga would get the __getattr__ value, and everything would be fine. The KeyError is ignored, after all.

...

Demo/tkinter/guido/AttrDialog.py:121 # Subclasses override self.classes

Hmm try: cl = self.classes[c] except KeyError: cl = 'unknown' So cl wouldn't be 'unknown'. Why would that be a problem?

...

Lib/ConfigParser.py:623

try: v = map[var] except KeyError: raise InterpolationMissingOptionError( option, section, rest, var) So there is no InterpolationMissingOptionError. *Of course not*. The whole point would be to provide a value for all interpolation variables.

...

Lib/random.py:315

This entire functions samples k elements with indices between 0 and len(population). Now, people "shouldn't" be passing dictionaries in in the first place; that specific code tests whether there valid values at indices 0, n//2, and n. If the dictionary isn't really a sequence (i.e. if it doesn't provide values at all indices), the function may later fail even if it passes that test. With a default-valued dictionary, the function would not fail, but a large number of samples might be the default value.

...

Lib/string.py:191

Same like ConfigParser: the intperpolation will always succeed, interpolating all values (rather than leaving $identifier in the string). That would be precisely the expected behaviour.

...

Lib/weakref.py:56 # Currently uses UserDict but I assume it will switch to dict eventually

Or, rather, UserDict might grow the on_missing feature as well. That is irrelevant for this issue, though: o = self.data[key]() if o is None: raise KeyError, key # line 56 else: return o So we are looking for lookup failures in self.data, here: self.dict is initialized to {} in UserDict, with no default factory. So there cannot be a change in behaviour.

...

Perhaps the KeyError shouldn't ever get triggered in this case, I'm not sure. I think that's besides the point though. The programmer clearly expected it would.

No. I now see your problem: An "except KeyError" does *not* mean that the programmer "clearly expects it will" raise an KeyError. Instead, the programmer expects it *might* raise a KeyError, and tries to deal with this situation. If the situation doesn't arise, the code continue just fine. Regards, Martin

"Martin v. Löwis"

11:13 p.m.

Adam Olsen wrote:

...

Still -1. It's better, but it violates the principle of encapsulation by mixing how-you-use-it state with what-it-stores state. In doing that it has the potential to break an API documented as accepting a dict. Code that expects d[key] to raise an exception (and catches the resulting KeyError) will now silently "succeed".

Of course it will, and without quotes. That's the whole point.

...

I believe that necessitates a PEP to document it.

You are missing the rationale of the PEP process. The point is *not* documentation. The point of the PEP process is to channel and collect discussion, so that the BDFL can make a decision. The BDFL is not bound at all to the PEP process. To document things, we use (or should use) documentation. Regards, Martin

Adam Olsen

12:08 a.m.

On 2/17/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Adam Olsen wrote:

...
Still -1. It's better, but it violates the principle of encapsulation by mixing how-you-use-it state with what-it-stores state. In doing that it has the potential to break an API documented as accepting a dict. Code that expects d[key] to raise an exception (and catches the resulting KeyError) will now silently "succeed".

Of course it will, and without quotes. That's the whole point.

Consider these two pieces of code: if key in d: dosomething(d[key]) else: dosomethingelse() try: dosomething(d[key]) except KeyError: dosomethingelse() Before they were the same (assuming dosomething() won't raise KeyError). Now they would behave differently. The latter is even the prefered form, since it only invokes a single dict lookup: On 2/16/06, Delaney, Timothy (Tim) <tdelaney@avaya.com> wrote:

...

try: v = d[key] except: v = d[key] = value

Obviously this example could be changed to use default_factory, but I find it hard to believe the only use of that pattern is to set default keys. Of course you could just assume that of all the people passing your function a dict, none of them will ever use the default_factory when they build the dict. Should be easy, right? -- Adam Olsen, aka Rhamphoryncus

Ian Bicking

12:34 a.m.

Adam Olsen wrote:

...

The latter is even the prefered form, since it only invokes a single dict lookup:

On 2/16/06, Delaney, Timothy (Tim) <tdelaney@avaya.com> wrote:

...
try: v = d[key] except: v = d[key] = value

Obviously this example could be changed to use default_factory, but I find it hard to believe the only use of that pattern is to set default keys.

I'd go further -- I doubt many cases where try:except KeyError: is used could be refactored to use default_factory -- default_factory can only be used to set default keys to something that can be determined sometime close to the time the dictionary is created, and that the default is not dependent on the context in which the key is fetched, and that default value will not cause unintended side effects if the dictionary leaks out of the code where it was initially used (like if the dictionary is returned to someone). Any default factory is more often an algorithmic detail than truly part of the nature of the dictionary itself. For instance, here is something I do often: try: value = cache[key] except KeyError: ... calculate value ... cache[key] = value Realistically, factoring "... calculate value ..." into a factory that calculates the value would be difficult, produce highly unreadable code, perform worse, and have more bugs. For simple factories like "list" and "dict" the factory works okay. For immutable values like 0 and None, the factory (lambda : 0 and lambda : None) is a wasteful way to create a default value (because storing the value in the dictionary is unnecessary). For non-trivial factories the whole thing falls apart, and one can just hope that no one will try to use this feature and will instead stick with the try:except KeyError: technique. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

"Martin v. Löwis"

12:52 a.m.

Adam Olsen wrote:

...

Consider these two pieces of code:

if key in d: dosomething(d[key]) else: dosomethingelse()

try: dosomething(d[key]) except KeyError: dosomethingelse()

Before they were the same (assuming dosomething() won't raise KeyError). Now they would behave differently.

I personally think they should continue to do the same thing, i.e. "in" should return True if there is a default; in the current proposal, it should invoke the default factory. But that's beside the point: Where is the real example where this difference would matter? (I'm not asking for a realistic example, I'm asking for a real one) Regards, Martin

Ian Bicking

1:32 a.m.

Martin v. Löwis wrote:

...

Adam Olsen wrote:

...
Consider these two pieces of code:

if key in d: dosomething(d[key]) else: dosomethingelse()

try: dosomething(d[key]) except KeyError: dosomethingelse()

Before they were the same (assuming dosomething() won't raise KeyError). Now they would behave differently.

I personally think they should continue to do the same thing, i.e. "in" should return True if there is a default; in the current proposal, it should invoke the default factory.

As I believe Fredrik implied, this would break the symmetry between "x in d" and "x in d.keys()" (unless d.keys() enumerates all possible keys), and either .get() would become useless, or it would also act in inconsistent ways. I think these broken expectations are much worse than what Adam's talking about.

...

But that's beside the point: Where is the real example where this difference would matter? (I'm not asking for a realistic example, I'm asking for a real one)

Well, here's a kind of an example: WSGI specifies that the environment must be a dictionary, and nothing but a dictionary. I think it would have to be updated to say that it must be a dictionary with default_factory not set, as default_factory would break the predictability that was the reason WSGI specified exactly a dictionary (and not a dictionary-like object or subclass). So there's something that becomes brokenish. I think this is the primary kind of breakage -- dictionaries with default_factory set are not acceptable objects when a "plain" dictionary is expected. Of course, it can always be claimed that it's the fault of the person who passes in such a dictionary (they could have passed in None and it would probably also be broken). But now all of the sudden I have to say "x(a) takes a dictionary argument. Oh, and don't you dare use the default_factory feature!" where before I could just say "dictionary". And KeyError just... disappears. KeyError is one of those errors that you *expect* to happen (maybe the "Error" part is a misnomer); having it disappear is a major change. Also, I believe there's two ways to handle thread safety, both of which are broken: 1) d[key] gets the GIL, and thus while default_factory is being called the GIL is locked 2) d[key] doesn't get the GIL and so d[key].append(1) may not actually lead to 1 being in d[key] if another thread is appending something to the same key at the same time, and the key is not yet present in d. Admittedly I don't understand the ins and outs of the GIL, so the first case might not actually need to acquire the GIL. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

"Martin v. Löwis"

8:33 a.m.

Ian Bicking wrote:

...

Well, here's a kind of an example: WSGI specifies that the environment must be a dictionary, and nothing but a dictionary. I think it would have to be updated to say that it must be a dictionary with default_factory not set, as default_factory would break the predictability that was the reason WSGI specified exactly a dictionary (and not a dictionary-like object or subclass). So there's something that becomes brokenish.

I don't understand. In the rationale of PEP 333, it says "The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface." That rationale is not endangered: if the environment continues to be a dict exactly, servers continue to be guaranteed what precise set of operations is available on the environment. Of course, that may change from Python version to Python version, as new dict methods get defined. But that should have been clear when the PEP was written: the dict type itself may evolve, providing additional features that weren't present in earlier versions. Even now, some dict implementations have setdefault(), others don't.

...

KeyError is one of those errors that you *expect* to happen (maybe the "Error" part is a misnomer); having it disappear is a major change.

Well, as you say: you get a KeyError if there is an error with the key. With a default_factory, there isn't normally an error with the key.

...

Also, I believe there's two ways to handle thread safety, both of which are broken:

1) d[key] gets the GIL, and thus while default_factory is being called the GIL is locked

2) d[key] doesn't get the GIL and so d[key].append(1) may not actually lead to 1 being in d[key] if another thread is appending something to the same key at the same time, and the key is not yet present in d.

It's 1), primarily. If default_factory is written in Python, though (e.g. if it is *not* list()), the interpreter will give up the GIL every N byte code instructions (or when a blocking operation is executed). Notice the same issue already exist with __hash__ for the key. Also notice that the same issue already exists with any kind of manipulation of a dictionary in multiple threads, today: if you do try: d[k].append(v) except KeyError: d[k] = [v] then two threads might interleavingly execute the except-suite. Regards, Martin

James Y Knight

7:44 p.m.

On Feb 18, 2006, at 2:33 AM, Martin v. Löwis wrote:

...

I don't understand. In the rationale of PEP 333, it says "The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface."

That rationale is not endangered: if the environment continues to be a dict exactly, servers continue to be guaranteed what precise set of operations is available on the environment.

Yes it is endangered.

...

Well, as you say: you get a KeyError if there is an error with the key. With a default_factory, there isn't normally an error with the key.

But there should be. Consider the case of two servers. One which takes all the items out of the dictionary (using items()) and puts them in some other data structure. Then it checks if the "Date" header has been set. It was not, so it adds it. Consider another similar server which checks if the "Date" header has been set on the dict passed in by the user. The default_factory then makes one up. Different behavior due to internal implementation details of how the server uses the dict object, which is what the restriction to _exactly_ dict prevents. Consider another server which takes the dict instance and transports it across thread boundaries, from the wsgi-app's thread to the main server thread. Because WSGI specifies that you can only use 'dict', and the server checked that type(obj) == dict, it is guaranteed that using the dict won't run thread-unsafe code. That is now broken, since dict.__getitem__ can now invoke arbitrary user code. That is a major change. James

"Martin v. Löwis"

8:06 p.m.

James Y Knight wrote:

...

But there should be. Consider the case of two servers. One which takes all the items out of the dictionary (using items()) and puts them in some other data structure. Then it checks if the "Date" header has been set. It was not, so it adds it. Consider another similar server which checks if the "Date" header has been set on the dict passed in by the user. The default_factory then makes one up. Different behavior due to internal implementation details of how the server uses the dict object, which is what the restriction to _exactly_ dict prevents.

Right. I would claim that this is an artificial example: you can't provide a HTTP_DATE value in a default_factory implementation, since you don't know what the key is. However, you are now making up a different rationale from the one the PEP specifies: The PEP says that you need an "exact dict" so that everybody knows precisely how the dictionary behaves; instead of having to define which precise subset of the dict API is to be used. *That* goal is still achieved: everybody knows that the dict might have an on_missing/default_factory implementation. So to find out whether HTTP_DATE has a value (which might be defaulted), you need to invoke d['HTTP_DATE'].

...

Consider another server which takes the dict instance and transports it across thread boundaries, from the wsgi-app's thread to the main server thread. Because WSGI specifies that you can only use 'dict', and the server checked that type(obj) == dict, it is guaranteed that using the dict won't run thread-unsafe code. That is now broken, since dict.__getitem__ can now invoke arbitrary user code. That is a major change.

Not at all. dict.__getitem__ could always invoke arbitrary user code, through __hash__. Regards, Martin

Adam Olsen

8:06 p.m.

On 2/18/06, James Y Knight <foom@fuhm.net> wrote:

...

On Feb 18, 2006, at 2:33 AM, Martin v. Löwis wrote:

...
Well, as you say: you get a KeyError if there is an error with the key. With a default_factory, there isn't normally an error with the key.

But there should be. Consider the case of two servers. One which takes all the items out of the dictionary (using items()) and puts them in some other data structure. Then it checks if the "Date" header has been set. It was not, so it adds it. Consider another similar server which checks if the "Date" header has been set on the dict passed in by the user. The default_factory then makes one up. Different behavior due to internal implementation details of how the server uses the dict object, which is what the restriction to _exactly_ dict prevents.

It just occured to me, what affect does this have on repr? Does it attempt to store the default_factory in the representation, or does it remove it? Is it even possible to store a reference to a builtin such as list and have eval restore it? -- Adam Olsen, aka Rhamphoryncus

Phillip J. Eby

12:34 a.m.

At 01:44 PM 02/18/2006 -0500, James Y Knight wrote:

...

On Feb 18, 2006, at 2:33 AM, Martin v. Löwis wrote:

...
I don't understand. In the rationale of PEP 333, it says "The rationale for requiring a dictionary is to maximize portability between servers. The alternative would be to define some subset of a dictionary's methods as being the standard and portable interface."

That rationale is not endangered: if the environment continues to be a dict exactly, servers continue to be guaranteed what precise set of operations is available on the environment.

Yes it is endangered.

So we'll update the spec to say you can't use a dict that has the default set. It's not reasonable to expect that language changes might not require updates to a PEP. Certainly, we don't have to worry about being backward compatible when it's only Python 2.5 that's affected by the change. :)

bokr＠oz.net

5:41 a.m.

On Sat, 18 Feb 2006 00:52:51 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin@v.loewis.de> wrote:

...

Adam Olsen wrote:

...
Consider these two pieces of code:

if key in d: dosomething(d[key]) else: dosomethingelse()

try: dosomething(d[key]) except KeyError: dosomethingelse()

Before they were the same (assuming dosomething() won't raise KeyError). Now they would behave differently.

I personally think they should continue to do the same thing, i.e. "in" should return True if there is a default; in the current proposal, it should invoke the default factory.

But that's beside the point: Where is the real example where this difference would matter? (I'm not asking for a realistic example, I'm asking for a real one)

My guess is that realistically default_factory will be used to make clean code for filling a dict, and then turning the factory off if it's to be passed into unknown contexts. Those contexts can then use old code to do as above, or if worth it can temporarily set a factory to do some work. Tightly coupled code I guess could pass factory-enabled dicts between each other. IOW, no code should break unless you pass a factory-enabled dict where you shouldn't ;-) That said, maybe enabling/disabling could be separate from d.default_factory (e.g., d.defaults_enabled) as that could allow e.g. foo(**kw) more options in how to copy kw and what foo could do. Would total copy including defaulting state be best? What other copies must be sanitized? type('Foo',(), **{'this':'one?'}) It will be interesting to see what comes out of the woodwork ;-) Regards, Bengt Richter

Greg Ewing

3:11 a.m.

Bengt Richter wrote:

...

My guess is that realistically default_factory will be used to make clean code for filling a dict, and then turning the factory off if it's to be passed into unknown contexts.

This suggests that maybe the autodict behaviour shouldn't be part of the dict itself, but provided by a wrapper around the dict. The you can fill the dict through the wrapper, and still have a normal dict underneath to use for other purposes. Greg

Josiah Carlson

4:50 a.m.

Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

Bengt Richter wrote:

...
My guess is that realistically default_factory will be used to make clean code for filling a dict, and then turning the factory off if it's to be passed into unknown contexts.

This suggests that maybe the autodict behaviour shouldn't be part of the dict itself, but provided by a wrapper around the dict.

The you can fill the dict through the wrapper, and still have a normal dict underneath to use for other purposes.

I prefer this to changing dictionaries directly. The actual wrapper could sit in the collections module, ready for subclassing/replacement of the on_missing method. - Josiah

Michael Urman

6:38 a.m.

On 2/17/06, Adam Olsen <rhamph@gmail.com> wrote:

...

if key in d: dosomething(d[key]) else: dosomethingelse()

try: dosomething(d[key]) except KeyError: dosomethingelse()

I agree with the gut feeling that these should still do the same thing. Could we modify d.get() instead?

...

...
...
class ddict(dict): ... default_value_factory = None ... def get(self, k, d=None): ... v = super(ddict, self).get(k, d) ... if v is not None or d is not None or self.default_value_factory is None: ... return v ... return self.setdefault(k, self.default_value_factory()) ... d = ddict() d.default_value_factory = list d.get('list', []) [] d['list'] Traceback (most recent call last): File "<stdin>", line 1, in ? KeyError: 'list' d.get('list').append(5) d['list'] [5]

There was never an exception raised by d.get so this wouldn't change (assuming the C is implemented more carefully than the python above). What are the problems with this other than, like setdefault, it only works on values with mutator methods (i.e., no counting dicts)? Is the lack of counting dicts that d.__getitem__ supports a deal breaker?

...

...
...
d.default_value_factory = int d.get('count') += 1 SyntaxError: can't assign to function call

How does the above either in dict or a subclass compare to five line or smaller custom subclasses using something like the following? def append(self, k, val): self.setdefault(k, []).append(val) or def accumulate(self, k, val): try: self[k] += val except KeyError: self[k] = val Michael -- Michael Urman http://www.tortall.net/mu/blog

Steve Holden

4:57 a.m.

Martin v. Löwis wrote:

...

Adam Olsen wrote:

...
Still -1. It's better, but it violates the principle of encapsulation by mixing how-you-use-it state with what-it-stores state. In doing that it has the potential to break an API documented as accepting a dict. Code that expects d[key] to raise an exception (and catches the resulting KeyError) will now silently "succeed".

Of course it will, and without quotes. That's the whole point.

...
I believe that necessitates a PEP to document it.

You are missing the rationale of the PEP process. The point is *not* documentation. The point of the PEP process is to channel and collect discussion, so that the BDFL can make a decision. The BDFL is not bound at all to the PEP process.

To document things, we use (or should use) documentation.

One could wish this ideal had been the case for the import extensions defined in PEP 302. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/

Paul Moore

1:01 p.m.

On 2/19/06, Steve Holden <steve@holdenweb.com> wrote:

...

...
You are missing the rationale of the PEP process. The point is *not* documentation. The point of the PEP process is to channel and collect discussion, so that the BDFL can make a decision. The BDFL is not bound at all to the PEP process.

To document things, we use (or should use) documentation.

One could wish this ideal had been the case for the import extensions defined in PEP 302.

(A bit off-topic, but that hit home, so I'll reply...) Agreed, and it's my fault they weren't, to some extent. I did try to find a suitable place, but the import docs are generally fairly scattered, and there wasn't a particularly good place to put the changes. Any suggestions would be gratefully accepted... Paul.

bokr＠oz.net

6:38 p.m.

New subject: documenting things [Was: Re: Proposal: defaultdict]

On Mon, 20 Feb 2006 12:01:02 +0000, "Paul Moore" <p.f.moore@gmail.com> wrote:

...

On 2/19/06, Steve Holden <steve@holdenweb.com> wrote:

...
...
You are missing the rationale of the PEP process. The point is *not* documentation. The point of the PEP process is to channel and collect discussion, so that the BDFL can make a decision. The BDFL is not bound at all to the PEP process.

To document things, we use (or should use) documentation.

One could wish this ideal had been the case for the import extensions defined in PEP 302.

(A bit off-topic, but that hit home, so I'll reply...)

Agreed, and it's my fault they weren't, to some extent. I did try to find a suitable place, but the import docs are generally fairly scattered, and there wasn't a particularly good place to put the changes.

Any suggestions would be gratefully accepted...

I've always thought we could leverage google to find good doc information if we would just tag it in some consistent way. E.g., if you wanted to post a partial draft of some pep doc, you could post it here and/or c.l.p with PEP 302 docs version 2 <<--ENDMARK-- text here ========= (use REST if ambitious) ... --ENDMARK-- If we had some standard tag lines, we could make an urllib tool to harvest the material and merge the most recent version paragraphs and auto-post it as html in one place for draft docs on python.org The same tagged section technique could be used re any documention, so long as update and/or addition text can be associated with where it should be tagged in as references. I think mouseover popup hints with clickable js popups for the additional material would be cool. It would mean automatically editing the doc to insert the hints though. Well, nothing there is rocket science, but neither is a wall of bricks so long and high you can't live long enough to complete it ;-) Regards, Bengt Richter

Steve Holden

4:44 a.m.

Guido van Rossum wrote:

...

On 2/16/06, Guido van Rossum <guido@python.org> wrote:

...
Over lunch with Alex Martelli, he proposed that a subclass of dict with this behavior (but implemented in C) would be a good addition to the language. It looks like it wouldn't be hard to implement. It could be a builtin named defaultdict. The first, required, argument to the constructor should be the default value. Remaining arguments (even keyword args) are passed unchanged to the dict constructor.

Thanks for all the constructive feedback. Here are some responses and a new proposal.

- Yes, I'd like to kill setdefault() in 3.0 if not sooner.

- It would indeed be nice if this was an optional feature of the standard dict type.

- I'm ignoring the request for other features (ordering, key transforms). If you want one of these, write a PEP!

- Many, many people suggested to use a factory function instead of a default value. This is indeed a much better idea (although slightly more cumbersome for the simplest cases).

One might think about calling it if it were callable, otherwise using it literally. Of course this would require jiggery-pokery int eh cases where you actually *wantes* the default value to be a callable (you'd have to provide a callable to return the callable as a default).

...

- Some people seem to think that a subclass constructor signature must match the base class constructor signature. That's not so. The subclass constructor must just be careful to call the base class constructor with the correct arguments. Think of the subclass constructor as a factory function.

True, but then this does get in the way of treating the base dict and its defaulting subtype polymorphically. That might not be a big issue.

...

- There's a fundamental difference between associating the default value with the dict object, and associating it with the call. So proposals to invent a better name/signature for setdefault() don't compete. (As to one specific such proposal, adding an optional bool as the 3rd argument to get(), I believe I've explained enough times in the past that flag-like arguments that always get a constant passed in at the call site are a bad idea and should usually be refactored into two separate methods.)

- The inconsistency introduced by __getitem__() returning a value for keys while get(), __contains__(), and keys() etc. don't show it, cannot be resolved usefully. You'll just have to live with it. Modifying get() to do the same thing as __getitem__() doesn't seem useful -- it just takes away a potentially useful operation.

So here's a new proposal.

Let's add a generic missing-key handling method to the dict class, as well as a default_factory slot initialized to None. The implementation is like this (but in C):

def on_missing(self, key): if self.default_factory is not None: value = self.default_factory() self[key] = value return value raise KeyError(key)

When __getitem__() (and *only* __getitem__()) finds that the requested key is not present in the dict, it calls self.on_missing(key) and returns whatever it returns -- or raises whatever it raises. __getitem__() doesn't need to raise KeyError any more, that's done by on_missing().

The on_missing() method can be overridden to implement any semantics you want when the key isn't found: return a value without inserting it, insert a value without copying it, only do it for certain key types/values, make the default incorporate the key, etc.

But the default implementation is designed so that we can write

d = {} d.default_factory = list

to create a dict that inserts a new list whenever a key is not found in __getitem__(), which is most useful in the original use case: implementing a multiset so that one can write

d[key].append(value)

to add a new key/value to the multiset without having to handle the case separately where the key isn't in the dict yet. This also works for sets instead of lists:

d = {} d.default_factory = set ... d[key].add(value)

This seems like a very good compromise. [non-functional alternatives ...]

...

regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/

Barry Warsaw

5:29 a.m.

On Fri, 2006-02-17 at 11:09 -0800, Guido van Rossum wrote:

...

Thanks for all the constructive feedback. Here are some responses and a new proposal.

- Yes, I'd like to kill setdefault() in 3.0 if not sooner.

A worthy goal, but not possible unless you want to break existing code. I don't think it's worth a DeprecationWarning either. Slating it for removal in 3.0 seems fine. Everything else about your proposal seems great. -Barry

Alex Martelli

8:34 p.m.

...

A bunch of Googlers were discussing the best way of doing the ... Wow, what a great discussion! As you'll recall, I had also mentioned

On 2/16/06, Guido van Rossum <guido@python.org> wrote: the "callable factory" as a live possibility, and there seems to be a strong sentiment in favor of that; not really a "weakness case" for HOFs, as you feared it might be during the lunchtime discussion. Out of all I've read here, I like the idea of having a collections.autodict (a much nicer name than defaultdict, a better collocation for 2.5 than the builtins). One point I think nobody has made is that whenever reasonably possible the setting of a callback (the callable factory here) should include *a and **k to use when calling back. So, for example: ad = collections.autodict(copy.copy, whatever) would easily cover the use case of Google's DefaultDict (yes, partial would also cover this use case, but having *a and **k is usefully more general). If you're convinced copy.copy is an overwhelmingly popular use case (I'm not, personally), then this specific idiom might also be summarized in a classmethod, a la ad = collections.autodict.by_copy(whatever) This way, all autodicts would start out empty (and be filled by update if needed). An alternative would be to have autodict's ctor have the same signature as dict's, with a separate .set_initial method to pass the factory (and *a, **k) -- this way an autodict might start out populated, but would always start with some default factory, such as lambda:None I guess. I think the first alternative (autodict always starts empty, but with a specifically chosen factory [including *a, **k]) is more useful. Alex

Guido van Rossum

8:58 p.m.

On 2/17/06, Alex Martelli <aleaxit@gmail.com> wrote:

...

...
A bunch of Googlers were discussing the best way of doing the ... Wow, what a great discussion! As you'll recall, I had also mentioned

On 2/16/06, Guido van Rossum <guido@python.org> wrote: the "callable factory" as a live possibility, and there seems to be a strong sentiment in favor of that; not really a "weakness case" for HOFs, as you feared it might be during the lunchtime discussion.

:-) You seem to have missed my revised proposal.

...

Out of all I've read here, I like the idea of having a collections.autodict (a much nicer name than defaultdict, a better collocation for 2.5 than the builtins). One point I think nobody has made is that whenever reasonably possible the setting of a callback (the callable factory here) should include *a and **k to use when calling back.

That's your C/C++ brain talking. :-) If you need additional data passed to a callback (to be provided at the time the callback is *set*, not when it is *called*) the customary approach is to make the callback a parameterless lambda; you can also use a bound method, etc. There's no need to complicate ever piece of code that calls a callback with the machinery to store and use arbirary arguments and keyword arguments. I forgot to mention in my revised proposal that the API for setting the default_factory is slightly odd: d = {} # or dict() d.default_factory = list rather than d = dict(default_factory=list) This is of course because we cut off that way when we defined what arbitrary keyword arguments to the dict constructor would do. My original proposal solved this by creating a subclass. But there were several suggestions that this would be fine functionality to add to the standard dict type -- and then I really don't see any other way to do this. (Yes, I could have a set_default_factory() method -- but a simple settable attribute seems more pythonic!) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Nick Coghlan

4:43 a.m.

Guido van Rossum wrote:

...

But there were several suggestions that this would be fine functionality to add to the standard dict type -- and then I really don't see any other way to do this.

Given the constructor problem, and the issue with "this function expects a plain dictionary", I think your original instinct to use a subclass may have been correct. The constructor is much cleaner that way: # bag like behavior dd = collections.autodict(int) for elem in collection: dd[elem] += 1 # setdefault-like behavior dd = collections.autodict(list) for page_number, page in enumerate(book): for word in page.split(): dd[word].append(word) And it can be a simple fact that for an autodict, "if key in d" and "d[key]" may give different answers. Much cleaner than making the semantics of normal dicts dependent on: a. whether or not on_missing has been overridden b. whether or not default_factory has been set Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Ian Bicking

8:51 p.m.

New subject: Counter proposal: multidict (was: Proposal: defaultdict)

I really don't like that defaultdict (or a dict extension) means that x[not_found] will have noticeable side effects. This all seems to be a roundabout way to address one important use case of a dictionary with multiple values for each key, and in the process breaking an important quality of good Python code, that attribute and getitem access not have noticeable side effects. So, here's a proposed interface for a new multidict object, borrowing some methods from Set but mostly from dict. Some things that seemed particularly questionable to me are marked with ??. class multidict: def __init__([mapping], [**kwargs]): """ Create a multidict: multidict() -> new empty multidict multidict(mapping) -> equivalent to: ob = multidict() ob.update(mapping) multidict(**kwargs) -> equivalent to: ob = multidict() ob.update(kwargs) """ def __contains__(key): """ True if ``self[key]`` is true """ def __getitem__(key): """ Returns a list of items associated with the given key. If nothing, then the empty list. ??: Is the list mutable, and to what effect? """ def __delitem__(key): """ Removes any instances of key from the dictionary. Does not raise an error if there are no values associated. ??: Should this raise a KeyError sometimes? """ def __setitem__(key, value): """ Same as: del self[key] self.add(key, value) """ def get(key, default=[]): """ Returns a list of items associated with the given key, or if that list would be empty it returns default """ def getfirst(key, default=None): """ Equivalent to: if key in self: return self[key][0] else: return default """ def add(key, value): """ Adds the value with the given key, so that self[key][-1] == value """ def remove(key, value): """ Remove (key, value) from the mapping (raising KeyError if not present). """ def discard(key, value): """ Remove like self.remove(key, value), except do not raise KeyError if missing. """ def pop(key): """ Removes key and returns the value; returns [] and does nothing if the key is not found. """ def keys(): """ Returns all the keys which have some associated value. """ def items(): """ Returns [(key, value)] for every key/value pair. Keys that have multiple values will be returned as multiple (key, value) tuples. """ def __len__(): """ Equivalent to len(self.items()) ??: Not len(self.keys())? """ def update(E, **kwargs): """ if E has iteritems then:: for k, v in E.iteritems(): self.add(k, v) elif E has keys: for k in E: self.add(k, E[k]) else: for k, v in E: self.add(k, v) ??: Should **kwargs be allowed? If so, should it the values be sequences? """ # iteritems, iterkeys, iter, has_key, copy, popitem, values, clear # with obvious implementations

Guido van Rossum

9:02 p.m.

New subject: Counter proposal: multidict (was: Proposal: defaultdict)

On 2/17/06, Ian Bicking <ianb@colorstudy.com> wrote:

...

I really don't like that defaultdict (or a dict extension) means that x[not_found] will have noticeable side effects. This all seems to be a roundabout way to address one important use case of a dictionary with multiple values for each key, and in the process breaking an important quality of good Python code, that attribute and getitem access not have noticeable side effects.

So, here's a proposed interface for a new multidict object, borrowing some methods from Set but mostly from dict. Some things that seemed particularly questionable to me are marked with ??.

Have you seen my revised proposal (which is indeed an addition to the standard dict rather than a subclass)? Your multidict addresses only one use case for the proposed behavior; what's so special about dicts of lists that they should have special support? What about dicts of dicts, dicts of sets, dicts of user-defined objects? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Ian Bicking

9:51 p.m.

New subject: Counter proposal: multidict

Guido van Rossum wrote:

...

On 2/17/06, Ian Bicking <ianb@colorstudy.com> wrote:

...
I really don't like that defaultdict (or a dict extension) means that x[not_found] will have noticeable side effects. This all seems to be a roundabout way to address one important use case of a dictionary with multiple values for each key, and in the process breaking an important quality of good Python code, that attribute and getitem access not have noticeable side effects.

So, here's a proposed interface for a new multidict object, borrowing some methods from Set but mostly from dict. Some things that seemed particularly questionable to me are marked with ??.

Have you seen my revised proposal (which is indeed an addition to the standard dict rather than a subclass)?

Yes, and though it is more general it has the same issue of side effects. Doesn't it seem strange that getting an item will change the values of .keys(), .items(), and .has_key()?

...

Your multidict addresses only one use case for the proposed behavior; what's so special about dicts of lists that they should have special support? What about dicts of dicts, dicts of sets, dicts of user-defined objects?

What's so special? 95% (probably more!) of current use of .setdefault() is .setdefault(key, []).append(value). Also, since when do features have to address all possible cases? Certainly there are other cases, and I think they can be answered with other classes. Here are some current options: .setdefault() -- works with any subtype; slightly less efficient than what you propose. Awkward to read; doesn't communicate intent very well. UserDict -- works for a few cases where you want to make dict-like objects. Messes up the concept of identity and containment -- resulting objects both "are" dictionaries, and "contain" a dictionary (obj.data). DictMixin -- does anything you can possibly want, requiring only the overriding of a couple methods. dict subclassing -- does anything you want as well, but you typically have to override many more methods than with DictMixin (and if you don't have to override every method, that's not documented in any way). Isn't written with subclassing in mind. Really, you are proposing that one specific kind of override be made feasible, either with subclassing or injecting a method. That said, I'm not saying that several kinds of behavior shouldn't be supported. I just don't see why dict should support them all (or multidict). And I also think dict will support them poorly. multidict implements one behavior *well*. In a documented way, with a name people can refer to. I can say "multidict", I can't say "a dict where I set default_factory to list" (well, I can say that, but that just opens up yet more questions and clarifications). Some ways multidict differs from default_factory=list: * __contains__ works (you have to use .get() with default_factory to get a meaningful result) * Barring cases where there are exceptions, x[key] and x.get(key) return the same value for multidict; with default_factory one returns [] and the other returns None when the key isn't found. But if you do x[key]; x.get(key) then x.get(key) always returns []. * You can't use __setitem__ to put non-list items into a multidict; with multidict you don't have to guard against non-sequences values. * [] is meaningful not just as the default value, but as a null value; the multidict implementation respects both aspects. * Specific method x.add(key, value) that indicates intent in a way that x[key].append(value) does not. * items and iteritems return values meaningful to the context (a list of (key, single_value) -- this is usually what I want, and avoids a nested for loop). __len__ also usefully different than in dict. * .update() handles iteritems sensibly, and updates from dictionaries sensibly -- if you mix a default_factory=list dict with a "normal" (single-value) dictionary you'll get an effectively corrupted dictionary (where some keys are lists) * x.getfirst(key) is useful * I think this will be much easier to reason about in situations with threads -- dict acts very predictably with threads, and people rely upon that * multidict can be written either with subclassing intended, or with an abstract superclass, so that other kinds of specializations of this superset of the dict interface can be made more easily (if DictMixin itself isn't already sufficient) So, I'm saying: multidict handles one very common collection need that dict handles awkwardly now. multidict is a meaningful and useful class with its own identity/name and meaning separate from dict, and has methods that represent both the intersection and the difference between the two classes. multidict does not in any way preclude other collection objects for other situations; it is entirely unfair to expect a new class to solve all issues. multidict suggests an interface that other related classes can use (e.g., an ordered version). multidict, unlike default_factory, is not just a recipe for creating a specific and commonly needed object, it is a class for creating it. -- Ian Bicking / ianb@colorstudy.com / http://blog.ianbicking.org

Fred L. Drake, Jr.

9:03 p.m.

New subject: Counter proposal: multidict (was: Proposal: defaultdict)

On Friday 17 February 2006 14:51, Ian Bicking wrote:

...

This all seems to be a roundabout way to address one important use case of a dictionary with multiple values for each key,

I think there are use cases that do not involve multiple values per key. That is one place where this commonly comes up, but not the only one.

...

and in the process breaking an important quality of good Python code, that attribute and getitem access not have noticeable side effects.

I'm not sure that's quite as well-defined or agreed upon as you do. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>

Jack Diederich

9:25 p.m.

New subject: Counter proposal: multidict (was: Proposal: defaultdict)

On Fri, Feb 17, 2006 at 03:03:06PM -0500, Fred L. Drake, Jr. wrote:

...

On Friday 17 February 2006 14:51, Ian Bicking wrote:

...
and in the process breaking an important quality of good Python code, that attribute and getitem access not have noticeable side effects.

I'm not sure that's quite as well-defined or agreed upon as you do.

Without the __getitem__ side effect default objects that don't support any operators would have problems. d[key] += val works fine when the default is a list or int but fails for dicts and presumably many user defined objects. By assigning the default value in __getitem__ the returned value can be manipulated via its methods. -Jack

6910

Age (days ago)

6923

Last active (days ago)

List overview

Download

103 comments

33 participants

participants (33)

"Martin v. Löwis"
Aahz
Adam Olsen
Alex Martelli
Barry Warsaw
Bernhard Herzog
bokr＠oz.net
CM
Fred L. Drake, Jr.
Fredrik Lundh
Fuzzyman
Gareth McCaughan
Georg Brandl
Greg Ewing
Guido van Rossum
Ian Bicking
Jack Diederich
James Y Knight
Josiah Carlson
Michael Hudson
Michael Urman
Nick Coghlan
Paul Moore
Phillip J. Eby
Pierre Barbier de Reuille
Raymond Hettinger
Raymond Hettinger
skip＠pobox.com
Steve Holden
Terry Reedy
Thomas Heller
Thomas Wouters
Walter Dörwald

Proposal: defaultdict

Pierre Barbier de Reuille

Raymond Hettinger

Raymond Hettinger

Josiah Carlson

Josiah Carlson

Adam Olsen

Josiah Carlson

Adam Olsen

Gareth McCaughan

Jack Diederich

CM

Michael Hudson

Adam Olsen

Raymond Hettinger

Adam Olsen

Adam Olsen

Adam Olsen

Adam Olsen

Josiah Carlson

Jack Diederich

tags

participants (33)