Re: [Python-ideas] dict.setdefault_call(), or API variations thereupon

I had actually managed to miss collections.defaultdict! I'd like to instead propose that a reference to that be added to the dict.setdefault docs. I can't imagine I'm the only one that has missed this. Date: Fri, 2 Nov 2018 12:12:45 +1100

On Thu, Nov 1, 2018 at 7:23 PM Robert Vanden Eynde <robertve92@gmail.com> wrote:
Well, defaultdict configures a default when an instance is created, while setdefault() is used when inserting a value. A major issue IMO with defaultdict is that if you try to *read* a non-existing key it will be inserted. -- --Guido van Rossum (python.org/~guido)

On Thu, Nov 01, 2018 at 08:58:28PM -0600, Alex Shafer wrote:
So it actually sounds like having a dict method for performing write operations with a factory function would be a semantic improvement.
As Chris pointed out, that's what __missing__ does. py> class MyDict(dict): ... def __missing__(self, key): ... return "something" ... py> d = MyDict(a=1, b=2) py> d['z'] 'something' py> d {'a': 1, 'b': 2} If you want the key to be inserted, do so in the __missing__ method. Is there something missing (pun not intended) from this existing functionality? The only improvement I'd like to see is to remove the need to subclass, so we could do this: py> d = {'a': 1} # Plain ol' regular dict, not a subclass. py> d.__missing__ = lambda self, key: "something" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'dict' object has no attribute '__missing__' but as you can see, that doesn't work. We'd need to either give every dict a full __dict__ instance namespace, or a __missing__ slot. Given how rare it is to use __missing__ I suspect the cost is not worth it. The bottom line is, if I understand your proposal, the functionality already exists. All you need do is subclass dict and give it a __missing__ method which does what you want. -- Steve

On Thu, Nov 1, 2018 at 8:34 PM, Steven D'Aprano <steve@pearwood.info> wrote:
or subclass dict and give it a "setdefault_call") method :-) But as I think Guido wasa pointing out, the real difference here is that DefaultDict, or any other subclass, is specifying what the default callable is for the entire dict, rather than at time of use. Personally, I'm pretty sure I"ve only used one default for any given dict, but I can imaige the are use cases for having different defaults for the same dict depending on context. As for the OP's justification: """ If it's not clear, the purpose is to eliminate the overhead of creating an empty list or similar in situations like this: d = {} for i in range(1000000): # some large loop l = d.setdefault(somekey, []) l.append(somevalue) # instead... for i in range(1000000): l = d.setdefault_call(somekey, list) l.append(somevalue) """ I presume the point is that in the first case, somekey might be often the same, and setdefault requires creating an actual empty list even if the key is alredy there. whereas case 2 will only create the empty list if the key is not there. doing some timing with defaultdict: In [19]: def setdefault(): ...: d = {} ...: somekey = 5 ...: for i in range(1000000): # some large loop ...: l = d.setdefault(somekey, []) ...: l.append(i) ...: return d In [20]: def default_dict(): ...: d = defaultdict(list) ...: somekey = 5 ...: for i in range(1000000): # some large loop ...: l = d[somekey] ...: l.append(i) ...: return d In [21]: % timeit setdefault() 185 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [22]: % timeit default_dict() 128 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) so yeah, it's a little more performant, and I suppose if you were using a more expensive constructor, it would make a lot more difference. But then, how much is it likely to matter in a real use cases -- this was 1 million calls for one key and you got a 50% speed up -- is that common? So it seems this would give us slightly better performance than .setdefault() for the use cases where you are using more than one default for a given dict. BTW: +1 for a mention of defaultdict in the dict.setdefault docs -- you can't do everything with defaultdict that you can with setdefault, but it is a very common use case. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Fri, Nov 02, 2018 at 09:52:24AM -0700, Chris Barker wrote:
Well sure, if we're making up our own methods and calling them anything we like :-) The status quo (as I see it): dict.setdefault: - takes an explicit, but eagerly evaluated, default value; dict.__missing__: - requires subclassing to make it work; - passes the missing key to the method, so the method can decide what value to return; defaultdict: - takes a zero-argument factory function which is unconditionally called when the key is missing. Did I miss any? What we don't have is a version of setdefault where the default is evaluated only on need. That would be a great use-case for Call-By-Name semantics and thunks, if Python supported such :-) (That's just a half-baked thought, not a concrete proposal.)
As you show below, a default callable for the dict is precisely the use-case the OP gives: l = d.setdefault_call(somekey, list) would be equivalent to defaultdict(list) and l = d[somekey]. (I think. Have I missed something?) Nevertheless, Guido's point is reasonable -- if it comes up in practice often enough to care. [...]
Are we sure that the overhead is significantly more than the cost of the name lookup of "list" and the expense of calling it? You do demonstrate a speed difference with defaultdict (thanks for doing the timing tests) but the situation isn't precisely comparable to the proposed method, since you aren't looking up the name "list" each time through the outer loop. Could construction of the empty list be optimized more? That might reduce the benefit even further (at least for the given case, but not for the general case of an arbitrarily expensive default). We keep coming up against the issue of *eager evaluation* versus *delayed evaluation*, and I can't help feel that rather that solving this problem on an ad-hoc basis each time it comes up, maybe we really do need a way to tell the interpreter "delay evaluating this expression until needed". Then we could use it anywhere it was important, without having to create a plethora of special case setdefault_call() methods and the like. -- Steve

Could you explain what the difference is between defaultdicts "factory which is unconditionally called when the key is missing" and "the default is evaluated only on need"? To me it seems that "unconditionally called when the key is missing" is a complex way of saying "called only when needed". I must be missing some nuance here. / Anders

On Fri, Nov 2, 2018 at 5:25 PM Anders Hovmöller <boxed@killingar.net> wrote:
The distinction was the motivation for this thread: setdefault requires a constructed default instance as an argument, regardless of whether the key is missing, whereas defaultdict's factory is only called if necessary. If the key is present in a defaultdict, no default is constructed.

On Sat, Nov 03, 2018 at 01:15:04AM +0100, Anders Hovmöller wrote:
Consider the use-case where you want to pass a different default value to the dict each time: d.setdefault(key, expensive_function(1, 2, 3)) d.setdefault(key, expensive_function(4, 8, 16)) d.setdefault(key, expensive_function(10, 100, 1000)) The expensive function is eagerly evaluated each time you call setdefault, whether the result is needed or not. defaultdict won't help, because your factory function takes no arguments: there's no way to supply arguments for the factory. __missing__ won't help, because it only receives the key, not arbitrary arguments. We can of course subclass dict and give it a method with the semantics we want: d.my_setdefault(key, expensive_function, args=(1, 2, 3), kw={}) but it would be nicer and more expressive if we could tell the interpreter "don't evaluate expensive_function(...) unless you really need it". Other languages have this -- I believe it is called "Call By Need" or "Call By Name", depending on the precise details of how it works. I call it delayed evaluation, and Python already has it, but only in certain special syntactic forms: spam and <delayed expression> spam or <delayed expression> <delayed expression> if condition else <delayed expression> There are others: e.g. the body of functions, including lambda. But functions are kinda heavyweight to make and build and call. -- Steve

On Fri, Nov 2, 2018 at 7:49 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Consider the use-case where you want to pass a different default value to the dict each time:
exactly - the "default" is per call, not the same for the whole dict. though again, how common is this?
also -- aside from performance, if expensive_function() has side effects, you may really not want to call it when you don't need to (not that that would be well-designed code, but...) and of course, you can always simply do: if key in d: val = d[key] else: val = expensive_function(4, 8, 16) d[key] = val sure, it requires looking up the key twice, but doesn't call the function unnecessarily. So it's a pretty small subset of cases, where this would be needed. defaultdict won't help, because your factory function takes no
arguments: there's no way to supply arguments for the factory.
maybe that's a feature defaultdict should have? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, Nov 1, 2018 at 7:23 PM Robert Vanden Eynde <robertve92@gmail.com> wrote:
Well, defaultdict configures a default when an instance is created, while setdefault() is used when inserting a value. A major issue IMO with defaultdict is that if you try to *read* a non-existing key it will be inserted. -- --Guido van Rossum (python.org/~guido)

On Thu, Nov 01, 2018 at 08:58:28PM -0600, Alex Shafer wrote:
So it actually sounds like having a dict method for performing write operations with a factory function would be a semantic improvement.
As Chris pointed out, that's what __missing__ does. py> class MyDict(dict): ... def __missing__(self, key): ... return "something" ... py> d = MyDict(a=1, b=2) py> d['z'] 'something' py> d {'a': 1, 'b': 2} If you want the key to be inserted, do so in the __missing__ method. Is there something missing (pun not intended) from this existing functionality? The only improvement I'd like to see is to remove the need to subclass, so we could do this: py> d = {'a': 1} # Plain ol' regular dict, not a subclass. py> d.__missing__ = lambda self, key: "something" Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'dict' object has no attribute '__missing__' but as you can see, that doesn't work. We'd need to either give every dict a full __dict__ instance namespace, or a __missing__ slot. Given how rare it is to use __missing__ I suspect the cost is not worth it. The bottom line is, if I understand your proposal, the functionality already exists. All you need do is subclass dict and give it a __missing__ method which does what you want. -- Steve

On Thu, Nov 1, 2018 at 8:34 PM, Steven D'Aprano <steve@pearwood.info> wrote:
or subclass dict and give it a "setdefault_call") method :-) But as I think Guido wasa pointing out, the real difference here is that DefaultDict, or any other subclass, is specifying what the default callable is for the entire dict, rather than at time of use. Personally, I'm pretty sure I"ve only used one default for any given dict, but I can imaige the are use cases for having different defaults for the same dict depending on context. As for the OP's justification: """ If it's not clear, the purpose is to eliminate the overhead of creating an empty list or similar in situations like this: d = {} for i in range(1000000): # some large loop l = d.setdefault(somekey, []) l.append(somevalue) # instead... for i in range(1000000): l = d.setdefault_call(somekey, list) l.append(somevalue) """ I presume the point is that in the first case, somekey might be often the same, and setdefault requires creating an actual empty list even if the key is alredy there. whereas case 2 will only create the empty list if the key is not there. doing some timing with defaultdict: In [19]: def setdefault(): ...: d = {} ...: somekey = 5 ...: for i in range(1000000): # some large loop ...: l = d.setdefault(somekey, []) ...: l.append(i) ...: return d In [20]: def default_dict(): ...: d = defaultdict(list) ...: somekey = 5 ...: for i in range(1000000): # some large loop ...: l = d[somekey] ...: l.append(i) ...: return d In [21]: % timeit setdefault() 185 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [22]: % timeit default_dict() 128 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) so yeah, it's a little more performant, and I suppose if you were using a more expensive constructor, it would make a lot more difference. But then, how much is it likely to matter in a real use cases -- this was 1 million calls for one key and you got a 50% speed up -- is that common? So it seems this would give us slightly better performance than .setdefault() for the use cases where you are using more than one default for a given dict. BTW: +1 for a mention of defaultdict in the dict.setdefault docs -- you can't do everything with defaultdict that you can with setdefault, but it is a very common use case. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Fri, Nov 02, 2018 at 09:52:24AM -0700, Chris Barker wrote:
Well sure, if we're making up our own methods and calling them anything we like :-) The status quo (as I see it): dict.setdefault: - takes an explicit, but eagerly evaluated, default value; dict.__missing__: - requires subclassing to make it work; - passes the missing key to the method, so the method can decide what value to return; defaultdict: - takes a zero-argument factory function which is unconditionally called when the key is missing. Did I miss any? What we don't have is a version of setdefault where the default is evaluated only on need. That would be a great use-case for Call-By-Name semantics and thunks, if Python supported such :-) (That's just a half-baked thought, not a concrete proposal.)
As you show below, a default callable for the dict is precisely the use-case the OP gives: l = d.setdefault_call(somekey, list) would be equivalent to defaultdict(list) and l = d[somekey]. (I think. Have I missed something?) Nevertheless, Guido's point is reasonable -- if it comes up in practice often enough to care. [...]
Are we sure that the overhead is significantly more than the cost of the name lookup of "list" and the expense of calling it? You do demonstrate a speed difference with defaultdict (thanks for doing the timing tests) but the situation isn't precisely comparable to the proposed method, since you aren't looking up the name "list" each time through the outer loop. Could construction of the empty list be optimized more? That might reduce the benefit even further (at least for the given case, but not for the general case of an arbitrarily expensive default). We keep coming up against the issue of *eager evaluation* versus *delayed evaluation*, and I can't help feel that rather that solving this problem on an ad-hoc basis each time it comes up, maybe we really do need a way to tell the interpreter "delay evaluating this expression until needed". Then we could use it anywhere it was important, without having to create a plethora of special case setdefault_call() methods and the like. -- Steve

Could you explain what the difference is between defaultdicts "factory which is unconditionally called when the key is missing" and "the default is evaluated only on need"? To me it seems that "unconditionally called when the key is missing" is a complex way of saying "called only when needed". I must be missing some nuance here. / Anders

On Fri, Nov 2, 2018 at 5:25 PM Anders Hovmöller <boxed@killingar.net> wrote:
The distinction was the motivation for this thread: setdefault requires a constructed default instance as an argument, regardless of whether the key is missing, whereas defaultdict's factory is only called if necessary. If the key is present in a defaultdict, no default is constructed.

On Sat, Nov 03, 2018 at 01:15:04AM +0100, Anders Hovmöller wrote:
Consider the use-case where you want to pass a different default value to the dict each time: d.setdefault(key, expensive_function(1, 2, 3)) d.setdefault(key, expensive_function(4, 8, 16)) d.setdefault(key, expensive_function(10, 100, 1000)) The expensive function is eagerly evaluated each time you call setdefault, whether the result is needed or not. defaultdict won't help, because your factory function takes no arguments: there's no way to supply arguments for the factory. __missing__ won't help, because it only receives the key, not arbitrary arguments. We can of course subclass dict and give it a method with the semantics we want: d.my_setdefault(key, expensive_function, args=(1, 2, 3), kw={}) but it would be nicer and more expressive if we could tell the interpreter "don't evaluate expensive_function(...) unless you really need it". Other languages have this -- I believe it is called "Call By Need" or "Call By Name", depending on the precise details of how it works. I call it delayed evaluation, and Python already has it, but only in certain special syntactic forms: spam and <delayed expression> spam or <delayed expression> <delayed expression> if condition else <delayed expression> There are others: e.g. the body of functions, including lambda. But functions are kinda heavyweight to make and build and call. -- Steve

On Fri, Nov 2, 2018 at 7:49 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Consider the use-case where you want to pass a different default value to the dict each time:
exactly - the "default" is per call, not the same for the whole dict. though again, how common is this?
also -- aside from performance, if expensive_function() has side effects, you may really not want to call it when you don't need to (not that that would be well-designed code, but...) and of course, you can always simply do: if key in d: val = d[key] else: val = expensive_function(4, 8, 16) d[key] = val sure, it requires looking up the key twice, but doesn't call the function unnecessarily. So it's a pretty small subset of cases, where this would be needed. defaultdict won't help, because your factory function takes no
arguments: there's no way to supply arguments for the factory.
maybe that's a feature defaultdict should have? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (7)
-
Alex Shafer
-
Anders Hovmöller
-
Chris Barker
-
Guido van Rossum
-
Michael Selik
-
Robert Vanden Eynde
-
Steven D'Aprano