[Python-ideas] set.add() return value

Tue Feb 17 23:28:52 CET 2009

The main argument against set.add() having a return value is that  
someone reading the code has to remember or guess the meaning
of the return value for s.add(x).  Possible interpretations:

  * Always None        # This is what we have now throughout the language
  * Always Self           # What is done in Ruby, Objective C, and Smalltalk
                                  #  to support method chaining.  We also do this in
                                  #  some of our own APIs like  Tkinter.
  * True if x existed 
  * True if x was added  # Opposite meaning from the previous interpretation
  * True if x was mutated # Opposite from if-existed and same as was-added

Each of the those interpretations are reasonable, so we should recognize
that "explicit is better than implicit" and " in the face of ambiguity, refuse the
temptation to guess".

The situation with set.discard() is similar but the last three cases are no longer
opposite (if-existed is the same as was-discarded and was-mutated).

I looked for examples in real-world code and found that the test/set pattern
is not as uniform as posited.  For example, collections.namedtuple() has a 
typical use case (special handling for duplicate values) that is not helped by 
the proposal:

    seen_names = set()
    for name in field_names:
        if name.startswith('_') and not rename:
            raise ValueError('Field names cannot start with an underscore: %r' % name)
        if name in seen_names:
            raise ValueError('Encountered duplicate field name: %r' % name)
        seen_names.add(name)

The performance argument is specious.  Greg's expected twofold speed-up
is based on speculation, not on objective timings of an actual test/set implementation.
To achieve a twofold speed-up, all of the following would have to be true:

   * Every s.add() is True.  Otherwise, there is no duplicate step to be saved.
   * The are zero hardware cache effects on the table search and the hash value
      computation.  But in real life, the cost of doing something twice can be
      twentyfold less the second time.  According to Intel, the cost of a memory
      access is a half-cycle but if there is a cache miss, it is as expensive as a 
      floating point divide.
   * The inner-loop does no other work.  This is unusual.  Typically, the if-seen
      path does some useful work.  Hence the test/set operation does not dominate.

I spent a couple of man-months writing and optimizing the set module.  If the
test/set pairing had proven viable, I would have included a separate method
for it.  My research on hash tables (documented in Objects/dictnotes.txt)
showed that real-world performance benefits were elusive.  The hardware
cache has already done much of our work for us.  We should take that freebie.
Take it from mr-i-want-to-optimize-everything, this proposal won't payoff
in terms of meaning speedups in real code.

Currently, when I teach sets, I take pleasure in the near-zero learning curve
and absence of special cases.  Set code becomes a prime example of
"readability counts" and "simple is better than complex."  More importantly,
I think there is value in API simplicity for lists, sets, and dicts, our basic tools.
The ABCs for sets currently reflect that simplicity and it would be sad if that
started to get lost in order to save one line here or there.

Summary:
   -1 on having set.add() get a new return value
   -0 on having a new, explicitly named method for an atomic test/add

okay-i-will-shut-up-now-ly yours,

Raymond