Re: [Python-ideas] set.add() return value

[Resend, hopefully bag.python.org is fixed now.] On Fri, Feb 13, 2009 at 5:24 PM, Steven D'Aprano <steve@pearwood.info> wrote:
This example also has a bug, which neither of the two posters responding caught (unless Bill J was being *very* subtle).
Regardless of the OP's misunderstanding, I might be +0 on changing .add() and similar methods to returning True if the set was changed and False if it wasn't. I don't see a serious API incompatibility looming here (I'm assuming that the None which it currently returns is discarded by almost all code calling it rather than relied upon). It seems cheap enough to make this change; the internal implementation has this information available (e.g. look at set_insert_key()). Before I go to +1 though I would have to see that there are enough examples found in real code that could benefit from this small optimization. There's also the nagging concern that once we do this for set operations people might ask to do this for dict operations too, and then what's next. Plus, the abstract base classes would have to be adjusted and then existing implementations thereof would become invalid as a result. IOW the slippery-slope argument. So maybe +0 is too string still; I'm wavering somewhere between -0 and +0 ATM. PS Note that this is not the same case as the list.sort() method -- the latter's return value would be redundant (and IMO misleading) while the return value for .add() provides new information. My position on that use case is not wavering. :-) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I think there could be some theoretical justification to do it for sets at least. The pattern of "if something isn't already in some set, then add it and do some further processing" turns up fairly frequently in various algorithms. With dicts it's a bit different -- usually you're looking in the dict first to get something out, and if you don't find it, you then do something to manufacture a value and put it in. This is covered by setdefault() or whatever the modern replacement for it is. -- Greg

On Tue, 17 Feb 2009 04:27:02 pm Raymond Hettinger wrote:
I'm with Raymond on this. I don't think it adds anything. It's not that saving one line is unimportant, sometimes saving one line is worth it. But there has to be a significant mental cost to that line to make it worth while, and I don't think that applies to any variation of: if x in set: print "already there" else: set.add(x) versus: if set.add(x): # or another method name if you prefer print "already there" (BTW, the two above have slightly different semantics. The first clearly and obviously only performs the add if x is not in the set; the second is unclear whether the add is attempted first or not, and an implementation could choose either.) Contrast Guido's description of why he likes "yield from A" compared to "for x in A: yield x": [quote] Your code-reading brain has to find the correspondence between the loop control variable of the for-loop header and the variable yielded in the loop body, see that nothing else is going on, and *then* it can draw the conclusion about the recursive generator. [end quote] As far as I can see, there is no similar significant mental shortcut added by giving set.add() a return value. -- Steven D'Aprano Operations Manager Cybersource Pty Ltd, ABN 13 053 904 082 Level 1, 130-132 Stawell St, Richmond VIC 3121 Tel: +61 3 9428 6922 Fax: +61 3 9428 6944 Web: http://www.cybersource.com.au

On Tue, Feb 17, 2009 at 2:36 AM, Steven D'Aprano <steve@pearwood.info> wrote:
These are truly different in terms of runtime performance, though. In the first example, let's examine both cases. If x isn't in the set, we hash x, perform a worst-case set lookup (because we have to scan the entire list for whatever bucket that x hashes to), hash x again, and then perform that same lookup again before adding x to the set. If x is in the set, we perform one hash and one lookup. In the second example, we always hash and lookup x exactly once, adding it if necessary. This could turn out to be a non-trivial difference in runtime if the set is large enough or x is an object with a non-trivial hash. Sure the difference is a constant factor, but it's a significant one, somewhere between 1.25-2 depending on the percentage of "add"s that are unique items (if every item is unique, we're looking at a 2x speedup for set operations). But maybe that's just my premature optimization coming through, Brandon

On Tue, Feb 17, 2009 at 12:37 AM, Brandon Mintern <bmintern@gmail.com> wrote:
Another way to optimize this without changing the API would be to cache the key, hash and result of the last lookup attempted. Then the second lookup would be pretty much free. It would usually only take a single pointer comparison to decide whether the cache is valid. Oh, and an INCREF/DECREF pair somewhere, though I suspect there's already such a pair and we'd simply delaying the DECREF. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Steven D'Aprano wrote:
There could be cases where the execution speed difference is important, especially if testing whether the element is in the set is expensive. Back when I was developing Plex, I found that the biggest bottleneck in building a scanner object was the NFA to DFA conversion. That algorithm consists almost entirely of testing whether things are in sets and adding them if they're not, and the set elements themselves are other sets, so they're not cheap to do membership tests on. If I'd had an atomic test-and-add operation for sets, I could imagine that part running about twice as fast as it did. -- Greg

[Greg Ewing]
Imagining and timing are two different things. See Object/dictnotes.txt for the results of a month-long effort to study ways to improve hash table performance. One result was that two successive accesses for the same key in a large table did *not* double the time. Due to cache effects, the second access was nearly free. Also, I think this performance discussion is over focusing on the set lookup operation as if it were the dominant part of typical programs. Remember, that every time there is dotted access, there is a hash table lookup. Just writing s.add(x) results in a lookup for "add" as well as the the lookup/insertion of x. The performance argument is a red-herring. Raymond P.S. The lookup time does nearly double if there is an expensive hash function. But then, it would make sense to cache the results of that hash just like we do with strings.

What about: old_len = len(s) s.add(elem) if len(s) != old_len: ... -- Marcin Kowalczyk qrczak@knm.org.pl http://qrnik.knm.org.pl/~qrczak/

Raymond Hettinger wrote:
One result was that two successive accesses for the same key in a large table did *not* double the time.
What kind of keys, though? In my case the keys aren't strings, they're tuples. Even after you've found the right slot in the hash table, you still need to compare the tuples.
Remember, that every time there is dotted access, there is a hash table lookup.
For a string, which is probably interned. As far as I'm aware, there is no notion of interning for tuples. -- Greg

Even if the hash has to be recomputed, processor caching makes the second pass very cheap compared to the first. I spent a month trying to optimize dicts and learned that this is a dead-end. It is certainly *not* worth mucking-up the API for sets. Even if it did improve you one app, that would be atypical. Most apps that test and add will do something inside the branch that consumes far more time than the contains test. Raymond

The main argument against set.add() having a return value is that someone reading the code has to remember or guess the meaning of the return value for s.add(x). Possible interpretations: * Always None # This is what we have now throughout the language * Always Self # What is done in Ruby, Objective C, and Smalltalk # to support method chaining. We also do this in # some of our own APIs like Tkinter. * True if x existed * True if x was added # Opposite meaning from the previous interpretation * True if x was mutated # Opposite from if-existed and same as was-added Each of the those interpretations are reasonable, so we should recognize that "explicit is better than implicit" and " in the face of ambiguity, refuse the temptation to guess". The situation with set.discard() is similar but the last three cases are no longer opposite (if-existed is the same as was-discarded and was-mutated). I looked for examples in real-world code and found that the test/set pattern is not as uniform as posited. For example, collections.namedtuple() has a typical use case (special handling for duplicate values) that is not helped by the proposal: seen_names = set() for name in field_names: if name.startswith('_') and not rename: raise ValueError('Field names cannot start with an underscore: %r' % name) if name in seen_names: raise ValueError('Encountered duplicate field name: %r' % name) seen_names.add(name) The performance argument is specious. Greg's expected twofold speed-up is based on speculation, not on objective timings of an actual test/set implementation. To achieve a twofold speed-up, all of the following would have to be true: * Every s.add() is True. Otherwise, there is no duplicate step to be saved. * The are zero hardware cache effects on the table search and the hash value computation. But in real life, the cost of doing something twice can be twentyfold less the second time. According to Intel, the cost of a memory access is a half-cycle but if there is a cache miss, it is as expensive as a floating point divide. * The inner-loop does no other work. This is unusual. Typically, the if-seen path does some useful work. Hence the test/set operation does not dominate. I spent a couple of man-months writing and optimizing the set module. If the test/set pairing had proven viable, I would have included a separate method for it. My research on hash tables (documented in Objects/dictnotes.txt) showed that real-world performance benefits were elusive. The hardware cache has already done much of our work for us. We should take that freebie. Take it from mr-i-want-to-optimize-everything, this proposal won't payoff in terms of meaning speedups in real code. Currently, when I teach sets, I take pleasure in the near-zero learning curve and absence of special cases. Set code becomes a prime example of "readability counts" and "simple is better than complex." More importantly, I think there is value in API simplicity for lists, sets, and dicts, our basic tools. The ABCs for sets currently reflect that simplicity and it would be sad if that started to get lost in order to save one line here or there. Summary: -1 on having set.add() get a new return value -0 on having a new, explicitly named method for an atomic test/add okay-i-will-shut-up-now-ly yours, Raymond

Raymond Hettinger <python@...> writes:
Sorry, that argument could be turned into an argument against *any* optimization. For example, "there's no need to optimize building dict literals, since most apps will do something with the dict that consumes far more time than building the dict". The assumption being that only optimizations which benefit a large class of applications should be attempted, which is bogus (are large classes of applications bottlenecked by dict literal creation? yet we have specialized opcodes for dict literal creation, and so on for lots of other things).

On Wed, Feb 18, 2009 at 4:58 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Well, the long form would be something like the benefits for a small class of apps vs. the effort in implementing it plus the cost for other apps. While the implementation cost of this feature is small (which makes it attractive to some), I think that the cost of developers guessing the returned value wrong is prohibitive.
Here you make a logic error. The presence of an opcode for anything in Python does not mean that it is considered to be worth optimizing -- it just means it is a primitive operation in the language, with syntax associated with it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Terry Reedy wrote:
Greg Ewing wrote [concerning NFA to DFA]: But why not just unconditionally add them?
Because you also need to know whether it was already in the set, so you can terminate the recursion.
Marcin just posted a nice solution for proceeding differently if really necessary.
Yes, that's a neat trick! It does obfuscate the code a little bit, though. -- Greg

Steven D'Aprano wrote:
I'm with Raymond on this.
Me too. Some languages return self from mutating methods (allowing chaining). Python's builtins and stdlib consistently (afaik) returns None (preventing chaining). Returning True or False depending on whether a mutation was done seems like an innovation. It is an intriguing idea, but doing so just occasionally would add mental burden. Why now change list.extend()? (False if the extension is empty.) or list.sort() (False if the list is already sorted.) List.append(), of course, would always return True. Of course, in Python, having *any* consistent return other than None is difficult because much mutation is done thru special methods wrapped in syntax. tjr

On Tue, Feb 17, 2009 at 10:39 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Returning True or False depending on whether a mutation was done seems like an innovation.
Not really, all Java containers and STL sets and maps do it.
The mental burden is not bigger than, say, learning that you append() to a list but you add() to a set. Consistency for consistency's sake is IMHO misguided here; there's only so much you can do by treating all containers uniformly. Unlike sequences, the motivation for sets (and dicts for that matter) comes from real use cases so we shouldn't dismiss this request by an all-or-nothing argument. George

Guido van Rossum <guido@python.org> wrote:
This example also has a bug, which neither of the two posters responding caught (unless Bill J was being *very* subtle).
Sorry, I should have been more brutal. To my mind, the fact that Steve got it wrong was a nice illustration of how much extra mental effort needed to be expended because the feature Ralf suggests wasn't available. You have to write a test, the test has to include an inversion, you have to introduce a new variable to hold the result of the test. That's something like 3 function points, instead of one. Functionality "built in" into the standard library is much better for everyone than functionality a programmer has to generate himself, mainly because of the extra review it gets. I think that's the real payoff for "batteries included". Bill

Bill Janssen wrote:
I'm not excusing my sloppiness, but I think that my mistake is an argument against the proposal that add (or in Greg's case, test_and_add) should return True if the object was *not* added (because it was already there). In other words, despite Greg's suggested name: was_it_there = myset.test_and_add(42) I was confused by my intuition that the return flag should mean the object was added, and then made the mental translation: test_and_add() returns True => 42 was added to the set became: 42 in set returns True => 42 is to be added to the set which is of course the opposite intention to the proposal but in the brief time I spent writing the email easy to make. I believe that some people will expect to use it like this: if set.test_and_add(42): print "42 was not added because it was already there." and others like this: if set.test_and_add(42): print "42 was added because it wasn't there." and yet others (I expect the majority, including me) will flip from one to the other and consequently find that 50% of the time they get it wrong. I think that test_and_add is naturally two operations, not one, and forcing it to be one operation will just lead to confusion. -- Steven

On Tue, Feb 17, 2009 at 3:01 PM, Steven D'Aprano <steve@pearwood.info>wrote:
I think that test_and_add is naturally two operations, not one, and forcing it to be one operation will just lead to confusion.
There is one scenario where testing membership and adding a new member *is* a single operation: multiple threads. That's a good reason to *not* support a test/change operation in regular sets as it may lead people to think (falsely) that it provide some kind of concurrency protection. --- Bruce

On Tue, Feb 17, 2009 at 6:58 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
That may depend on whether the hash or equality test is implemented in Python. Also, I've heard of lock-free thread-safe dict implementations in Java where it may not be so easy to decide whether you actually inserted something or not -- and it may or may not be there when you look for it again. All in all, I am now siding with Raymond -- let's leave well enough alone. Hopefully we can now retire this thread, I don't think there's much more that can be said that hasn't been said already. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

FWIW, a few thoughts anyway: - I don't actually care about performance**. It was a mistake on my part to try to abuse the performance argument for what I was really after: elegance (defined as power/volume while still looking simple). - I always assumed it must be an oversight that set.add() doesn't return a trivial piece of information that it has. Somebody has made the arbitrary decision to withhold this information from me, like you withhold certain information from a child. It feels very wrong. I'm grown up and wish to take responsibility myself. "If I don't like it, you cannot use it" is forceful; "If you don't like it, just don't use it" is a much better tone between partners. - The purity argument was used a lot. In my opinion, set would be pure if it didn't arbitrarily withhold information from me that would allow me to make my algorithms more concise. Or as Einstein put it, "Make everything as simple as possible, but not simpler." set.add() returning None is failing the "but" part. ** 1. If performance really matters I will go to C++ anyway. 2. These days, I can simply put "if (i % chunk_n != chunk_i): continue" into some loop and throw the calculation at a few hundred (chunk_n) CPUs.

On Thu, Feb 19, 2009 at 1:01 AM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
I'm sorry, but your choice of words in the last two bullets feels very condescending to me, and despite my vow to end the thread I cannot let it the last words spoken be the words of a sore loser. Your position completely ignores the argument about people getting confused about what the return value would mean: "it was added" vs. "it was already there". It also seems to counter the elegance argument: if we cram as much information into every API as we can, certainly our APIs will feel more crowded and less elegant. Why not have list.append() return the new length of the list? Etc., etc. The overarching design principle in Python is to encourage the most maintainable code. This usually means to encourage concise code, but not so concise as to be hard to understand (or easily to misread) for the next person maintaining it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
I think there could be some theoretical justification to do it for sets at least. The pattern of "if something isn't already in some set, then add it and do some further processing" turns up fairly frequently in various algorithms. With dicts it's a bit different -- usually you're looking in the dict first to get something out, and if you don't find it, you then do something to manufacture a value and put it in. This is covered by setdefault() or whatever the modern replacement for it is. -- Greg

On Tue, 17 Feb 2009 04:27:02 pm Raymond Hettinger wrote:
I'm with Raymond on this. I don't think it adds anything. It's not that saving one line is unimportant, sometimes saving one line is worth it. But there has to be a significant mental cost to that line to make it worth while, and I don't think that applies to any variation of: if x in set: print "already there" else: set.add(x) versus: if set.add(x): # or another method name if you prefer print "already there" (BTW, the two above have slightly different semantics. The first clearly and obviously only performs the add if x is not in the set; the second is unclear whether the add is attempted first or not, and an implementation could choose either.) Contrast Guido's description of why he likes "yield from A" compared to "for x in A: yield x": [quote] Your code-reading brain has to find the correspondence between the loop control variable of the for-loop header and the variable yielded in the loop body, see that nothing else is going on, and *then* it can draw the conclusion about the recursive generator. [end quote] As far as I can see, there is no similar significant mental shortcut added by giving set.add() a return value. -- Steven D'Aprano Operations Manager Cybersource Pty Ltd, ABN 13 053 904 082 Level 1, 130-132 Stawell St, Richmond VIC 3121 Tel: +61 3 9428 6922 Fax: +61 3 9428 6944 Web: http://www.cybersource.com.au

On Tue, Feb 17, 2009 at 2:36 AM, Steven D'Aprano <steve@pearwood.info> wrote:
These are truly different in terms of runtime performance, though. In the first example, let's examine both cases. If x isn't in the set, we hash x, perform a worst-case set lookup (because we have to scan the entire list for whatever bucket that x hashes to), hash x again, and then perform that same lookup again before adding x to the set. If x is in the set, we perform one hash and one lookup. In the second example, we always hash and lookup x exactly once, adding it if necessary. This could turn out to be a non-trivial difference in runtime if the set is large enough or x is an object with a non-trivial hash. Sure the difference is a constant factor, but it's a significant one, somewhere between 1.25-2 depending on the percentage of "add"s that are unique items (if every item is unique, we're looking at a 2x speedup for set operations). But maybe that's just my premature optimization coming through, Brandon

On Tue, Feb 17, 2009 at 12:37 AM, Brandon Mintern <bmintern@gmail.com> wrote:
Another way to optimize this without changing the API would be to cache the key, hash and result of the last lookup attempted. Then the second lookup would be pretty much free. It would usually only take a single pointer comparison to decide whether the cache is valid. Oh, and an INCREF/DECREF pair somewhere, though I suspect there's already such a pair and we'd simply delaying the DECREF. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Steven D'Aprano wrote:
There could be cases where the execution speed difference is important, especially if testing whether the element is in the set is expensive. Back when I was developing Plex, I found that the biggest bottleneck in building a scanner object was the NFA to DFA conversion. That algorithm consists almost entirely of testing whether things are in sets and adding them if they're not, and the set elements themselves are other sets, so they're not cheap to do membership tests on. If I'd had an atomic test-and-add operation for sets, I could imagine that part running about twice as fast as it did. -- Greg

[Greg Ewing]
Imagining and timing are two different things. See Object/dictnotes.txt for the results of a month-long effort to study ways to improve hash table performance. One result was that two successive accesses for the same key in a large table did *not* double the time. Due to cache effects, the second access was nearly free. Also, I think this performance discussion is over focusing on the set lookup operation as if it were the dominant part of typical programs. Remember, that every time there is dotted access, there is a hash table lookup. Just writing s.add(x) results in a lookup for "add" as well as the the lookup/insertion of x. The performance argument is a red-herring. Raymond P.S. The lookup time does nearly double if there is an expensive hash function. But then, it would make sense to cache the results of that hash just like we do with strings.

What about: old_len = len(s) s.add(elem) if len(s) != old_len: ... -- Marcin Kowalczyk qrczak@knm.org.pl http://qrnik.knm.org.pl/~qrczak/

Raymond Hettinger wrote:
One result was that two successive accesses for the same key in a large table did *not* double the time.
What kind of keys, though? In my case the keys aren't strings, they're tuples. Even after you've found the right slot in the hash table, you still need to compare the tuples.
Remember, that every time there is dotted access, there is a hash table lookup.
For a string, which is probably interned. As far as I'm aware, there is no notion of interning for tuples. -- Greg

Even if the hash has to be recomputed, processor caching makes the second pass very cheap compared to the first. I spent a month trying to optimize dicts and learned that this is a dead-end. It is certainly *not* worth mucking-up the API for sets. Even if it did improve you one app, that would be atypical. Most apps that test and add will do something inside the branch that consumes far more time than the contains test. Raymond

The main argument against set.add() having a return value is that someone reading the code has to remember or guess the meaning of the return value for s.add(x). Possible interpretations: * Always None # This is what we have now throughout the language * Always Self # What is done in Ruby, Objective C, and Smalltalk # to support method chaining. We also do this in # some of our own APIs like Tkinter. * True if x existed * True if x was added # Opposite meaning from the previous interpretation * True if x was mutated # Opposite from if-existed and same as was-added Each of the those interpretations are reasonable, so we should recognize that "explicit is better than implicit" and " in the face of ambiguity, refuse the temptation to guess". The situation with set.discard() is similar but the last three cases are no longer opposite (if-existed is the same as was-discarded and was-mutated). I looked for examples in real-world code and found that the test/set pattern is not as uniform as posited. For example, collections.namedtuple() has a typical use case (special handling for duplicate values) that is not helped by the proposal: seen_names = set() for name in field_names: if name.startswith('_') and not rename: raise ValueError('Field names cannot start with an underscore: %r' % name) if name in seen_names: raise ValueError('Encountered duplicate field name: %r' % name) seen_names.add(name) The performance argument is specious. Greg's expected twofold speed-up is based on speculation, not on objective timings of an actual test/set implementation. To achieve a twofold speed-up, all of the following would have to be true: * Every s.add() is True. Otherwise, there is no duplicate step to be saved. * The are zero hardware cache effects on the table search and the hash value computation. But in real life, the cost of doing something twice can be twentyfold less the second time. According to Intel, the cost of a memory access is a half-cycle but if there is a cache miss, it is as expensive as a floating point divide. * The inner-loop does no other work. This is unusual. Typically, the if-seen path does some useful work. Hence the test/set operation does not dominate. I spent a couple of man-months writing and optimizing the set module. If the test/set pairing had proven viable, I would have included a separate method for it. My research on hash tables (documented in Objects/dictnotes.txt) showed that real-world performance benefits were elusive. The hardware cache has already done much of our work for us. We should take that freebie. Take it from mr-i-want-to-optimize-everything, this proposal won't payoff in terms of meaning speedups in real code. Currently, when I teach sets, I take pleasure in the near-zero learning curve and absence of special cases. Set code becomes a prime example of "readability counts" and "simple is better than complex." More importantly, I think there is value in API simplicity for lists, sets, and dicts, our basic tools. The ABCs for sets currently reflect that simplicity and it would be sad if that started to get lost in order to save one line here or there. Summary: -1 on having set.add() get a new return value -0 on having a new, explicitly named method for an atomic test/add okay-i-will-shut-up-now-ly yours, Raymond

Raymond Hettinger <python@...> writes:
Sorry, that argument could be turned into an argument against *any* optimization. For example, "there's no need to optimize building dict literals, since most apps will do something with the dict that consumes far more time than building the dict". The assumption being that only optimizations which benefit a large class of applications should be attempted, which is bogus (are large classes of applications bottlenecked by dict literal creation? yet we have specialized opcodes for dict literal creation, and so on for lots of other things).

On Wed, Feb 18, 2009 at 4:58 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Well, the long form would be something like the benefits for a small class of apps vs. the effort in implementing it plus the cost for other apps. While the implementation cost of this feature is small (which makes it attractive to some), I think that the cost of developers guessing the returned value wrong is prohibitive.
Here you make a logic error. The presence of an opcode for anything in Python does not mean that it is considered to be worth optimizing -- it just means it is a primitive operation in the language, with syntax associated with it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Terry Reedy wrote:
Greg Ewing wrote [concerning NFA to DFA]: But why not just unconditionally add them?
Because you also need to know whether it was already in the set, so you can terminate the recursion.
Marcin just posted a nice solution for proceeding differently if really necessary.
Yes, that's a neat trick! It does obfuscate the code a little bit, though. -- Greg

Steven D'Aprano wrote:
I'm with Raymond on this.
Me too. Some languages return self from mutating methods (allowing chaining). Python's builtins and stdlib consistently (afaik) returns None (preventing chaining). Returning True or False depending on whether a mutation was done seems like an innovation. It is an intriguing idea, but doing so just occasionally would add mental burden. Why now change list.extend()? (False if the extension is empty.) or list.sort() (False if the list is already sorted.) List.append(), of course, would always return True. Of course, in Python, having *any* consistent return other than None is difficult because much mutation is done thru special methods wrapped in syntax. tjr

On Tue, Feb 17, 2009 at 10:39 AM, Terry Reedy <tjreedy@udel.edu> wrote:
Returning True or False depending on whether a mutation was done seems like an innovation.
Not really, all Java containers and STL sets and maps do it.
The mental burden is not bigger than, say, learning that you append() to a list but you add() to a set. Consistency for consistency's sake is IMHO misguided here; there's only so much you can do by treating all containers uniformly. Unlike sequences, the motivation for sets (and dicts for that matter) comes from real use cases so we shouldn't dismiss this request by an all-or-nothing argument. George

Guido van Rossum <guido@python.org> wrote:
This example also has a bug, which neither of the two posters responding caught (unless Bill J was being *very* subtle).
Sorry, I should have been more brutal. To my mind, the fact that Steve got it wrong was a nice illustration of how much extra mental effort needed to be expended because the feature Ralf suggests wasn't available. You have to write a test, the test has to include an inversion, you have to introduce a new variable to hold the result of the test. That's something like 3 function points, instead of one. Functionality "built in" into the standard library is much better for everyone than functionality a programmer has to generate himself, mainly because of the extra review it gets. I think that's the real payoff for "batteries included". Bill

Bill Janssen wrote:
I'm not excusing my sloppiness, but I think that my mistake is an argument against the proposal that add (or in Greg's case, test_and_add) should return True if the object was *not* added (because it was already there). In other words, despite Greg's suggested name: was_it_there = myset.test_and_add(42) I was confused by my intuition that the return flag should mean the object was added, and then made the mental translation: test_and_add() returns True => 42 was added to the set became: 42 in set returns True => 42 is to be added to the set which is of course the opposite intention to the proposal but in the brief time I spent writing the email easy to make. I believe that some people will expect to use it like this: if set.test_and_add(42): print "42 was not added because it was already there." and others like this: if set.test_and_add(42): print "42 was added because it wasn't there." and yet others (I expect the majority, including me) will flip from one to the other and consequently find that 50% of the time they get it wrong. I think that test_and_add is naturally two operations, not one, and forcing it to be one operation will just lead to confusion. -- Steven

On Tue, Feb 17, 2009 at 3:01 PM, Steven D'Aprano <steve@pearwood.info>wrote:
I think that test_and_add is naturally two operations, not one, and forcing it to be one operation will just lead to confusion.
There is one scenario where testing membership and adding a new member *is* a single operation: multiple threads. That's a good reason to *not* support a test/change operation in regular sets as it may lead people to think (falsely) that it provide some kind of concurrency protection. --- Bruce

On Tue, Feb 17, 2009 at 6:58 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
That may depend on whether the hash or equality test is implemented in Python. Also, I've heard of lock-free thread-safe dict implementations in Java where it may not be so easy to decide whether you actually inserted something or not -- and it may or may not be there when you look for it again. All in all, I am now siding with Raymond -- let's leave well enough alone. Hopefully we can now retire this thread, I don't think there's much more that can be said that hasn't been said already. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

FWIW, a few thoughts anyway: - I don't actually care about performance**. It was a mistake on my part to try to abuse the performance argument for what I was really after: elegance (defined as power/volume while still looking simple). - I always assumed it must be an oversight that set.add() doesn't return a trivial piece of information that it has. Somebody has made the arbitrary decision to withhold this information from me, like you withhold certain information from a child. It feels very wrong. I'm grown up and wish to take responsibility myself. "If I don't like it, you cannot use it" is forceful; "If you don't like it, just don't use it" is a much better tone between partners. - The purity argument was used a lot. In my opinion, set would be pure if it didn't arbitrarily withhold information from me that would allow me to make my algorithms more concise. Or as Einstein put it, "Make everything as simple as possible, but not simpler." set.add() returning None is failing the "but" part. ** 1. If performance really matters I will go to C++ anyway. 2. These days, I can simply put "if (i % chunk_n != chunk_i): continue" into some loop and throw the calculation at a few hundred (chunk_n) CPUs.

On Thu, Feb 19, 2009 at 1:01 AM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
I'm sorry, but your choice of words in the last two bullets feels very condescending to me, and despite my vow to end the thread I cannot let it the last words spoken be the words of a sore loser. Your position completely ignores the argument about people getting confused about what the return value would mean: "it was added" vs. "it was already there". It also seems to counter the elegance argument: if we cram as much information into every API as we can, certainly our APIs will feel more crowded and less elegant. Why not have list.append() return the new length of the list? Etc., etc. The overarching design principle in Python is to encourage the most maintainable code. This usually means to encourage concise code, but not so concise as to be hard to understand (or easily to misread) for the next person maintaining it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (12)
-
Antoine Pitrou
-
Bill Janssen
-
Brandon Mintern
-
Bruce Leban
-
George Sakkis
-
Greg Ewing
-
Guido van Rossum
-
Marcin 'Qrczak' Kowalczyk
-
Ralf W. Grosse-Kunstleve
-
Raymond Hettinger
-
Steven D'Aprano
-
Terry Reedy