[Python-ideas] set.add() return value

Guido van Rossum guido at python.org
Tue Feb 17 20:56:17 CET 2009

On Tue, Feb 17, 2009 at 12:37 AM, Brandon Mintern <bmintern at gmail.com> wrote:
> On Tue, Feb 17, 2009 at 2:36 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>> if x in set:
>>    print "already there"
>> else:
>>    set.add(x)
>> versus:
>> if set.add(x):  # or another method name if you prefer
>>    print "already there"
> These are truly different in terms of runtime performance, though. In
> the first example, let's examine both cases. If x isn't in the set, we
> hash x, perform a worst-case set lookup (because we have to scan the
> entire list for whatever bucket that x hashes to), hash x again, and
> then perform that same lookup again before adding x to the set. If x
> is in the set, we perform one hash and one lookup.
> In the second example, we always hash and lookup x exactly once,
> adding it if necessary. This could turn out to be a non-trivial
> difference in runtime if the set is large enough or x is an object
> with a non-trivial hash. Sure the difference is a constant factor, but
> it's a significant one, somewhere between 1.25-2 depending on the
> percentage of "add"s that are unique items (if every item is unique,
> we're looking at a 2x speedup for set operations).
> But maybe that's just my premature optimization coming through,

Another way to optimize this without changing the API would be to
cache the key, hash and result of the last lookup attempted. Then the
second lookup would be pretty much free. It would usually only take a
single pointer comparison to decide whether the cache is valid. Oh,
and an INCREF/DECREF pair somewhere, though I suspect there's already
such a pair and we'd simply delaying the DECREF.

--Guido van Rossum (home page: http://www.python.org/~guido/)

More information about the Python-ideas mailing list