
Has this come up before? Python 2.6.1 behavior:
Desired behavior:
Motivation: Instead of if (1 not in s): # O(N log N) lookup s.add(1) # O(N log N) lookup again do_something_else() or prev_len = len(s) s.add(1) # O(N log N) lookup if (len(s) != prev_len): do_something_else() one could write if (s.add(1)): do_something_else() which would be as fast as the second form and the most concise of all alternatives.

[Ralf W. Grosse-Kunstleve]
Three thoughts: * insertion and lookup times are O(1), not O(n log n). * because of caching the second lookup is very cheap. * the bdfl frowns on mutating methods returning anything at all
if (s.add(1)): do_something_else()
I would find this to be useful but don't find it to be a significant improvement over the original. Also, I find it to be a bit tricky. Currently, sets have a nearly zero learning curve and are not imbued with non-obvious behaviors. The proposed form requires that the reader knows about the special boolean found/notfound behavior. Also, since some people do want mutating methods to return a copy of the collection (i.e. s.add(1).add(2).add(3)), those folks will find your suggestion to be counter-intuitive. Raymond

* insertion and lookup times are O(1), not O(n log n).
Sorry, I was thinking the .add() is happening inside a loop with N iterations, with N also being the eventual size of the set. Is O(N log N) correct then? http://en.wikipedia.org/wiki/Big_O_notation says: O(log N) example: finding an item in a sorted array with a binary search. Isn't that what set is doing?

On Thu, Feb 12, 2009 at 2:59 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
Python's set uses an unsorted hash table internally and is thus O(1), not O(N). Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com

This is at odds with the results of a simple experiment. Please try the attached script. On my machine I get these results: lookup repeats: 1000000 timing set lookup N: 10000 overhead: 1.16 lookup contained: 0.07 lookup missing: 0.12 timing set lookup N: 10000000 overhead: 1.13 lookup contained: 0.42 lookup missing: 0.34 timing dict lookup N: 10000 overhead: 1.16 lookup contained: 0.06 lookup missing: 0.11 timing dict lookup N: 10000000 overhead: 1.12 lookup contained: 0.44 lookup missing: 0.44 N is the size of a set or dict. The number of lookups of random elements is the same in both cases. That's why the "overhead" is constant. Of course, the memory cache does have an influence here, but your claim is also at odds with anything I could find in the literature. A dictionary or set has to do some kind of search to find a key or element. Dictionary and set lookups have O(log N) runtime complexity, where N is the size of the dict or set at the time of the lookup. To come back to my original suggestion, Raymond's reply...
* because of caching the second lookup is very cheap.
makes me think that this...
is basically as fast as I was hoping to get from:
if (s.add(1)): do_something_else()
So I guess that kills my runtime argument.
* the bdfl frowns on mutating methods returning anything at all
Shock! What about dict.setdefault()? Honestly, I found it very odd from the start that set.add() doesn't tell me what it did. import random import time import os import sys repeats = 1000000 print "lookup repeats:", repeats for now_with_dict in [False, True]: random.seed(0) for N in [10000, 10000000]: if (not now_with_dict): print "timing set lookup" s = set(xrange(N)) else: print "timing dict lookup" s = {} for i in xrange(N): s[i] = 0 print " N:", len(s) t0 = os.times()[0] for i in xrange(repeats): j = random.randrange(N) assert j < N oh = os.times()[0]-t0 print " overhead: %.2f" % oh t0 = os.times()[0] for i in xrange(repeats): j = random.randrange(N) assert j in s print " lookup contained: %.2f" % (os.times()[0]-t0-oh) t0 = os.times()[0] for i in xrange(repeats): j = N + random.randrange(N) assert j not in s print " lookup missing: %.2f" % (os.times()[0]-t0-oh)

On Thu, Feb 12, 2009 at 8:36 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
The literature you found was assuming a binary tree implementation, which is O(log n). Python's dict and set use a hash table, which is O(1). The performance effect you found was because of the CPU's cache. Besides, you really should use more than 2 samples if you want to show an O(log n) performance curve. -- Adam Olsen, aka Rhamphoryncus

Ralf W. Grosse-Kunstleve wrote:
Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
This is at odds with the results of a simple experiment.
Timing experiments are tricky to get right on multi-processing machines, which is virtually all PCs these days. For small code snippets, you are better off using the timeit module rather than re-inventing the wheel. That will have the very desirable outcome that others reading your code are dealing with a known and trusted component, rather than having to work out how you are doing your timing.
How do you determine that something fits a log N curve from just two data points? There's an infinite number of curves that pass through two data points. It's true that the results you found aren't consistent with O(1), but as I understand it, Python dicts are O(1) amortized ("on average over the long term"). Sometimes dicts resize, which is not a constant time operation, and sometimes the dict has to walk a short linked list, which depends on the proportion of hashes that lead to a collisions. But more importantly, I don't think you're necessarily measuring what you think you're measuring. I see that you include a call to random.randrange(N) within the timing loop. I don't think there is any guarantee that randrange(N) will take the same amount of time for any N. I'm not sure if that is actually the cause of your results, but it is a potential issue. When timing, you should try to time the barest minimum of code. This gives a quick demonstration of constant look-up time for sets:
Regards, -- Steven

Thanks for the insight. I didn't know such a process is considered O(1). I agree it is fair because in practice it seems to work very well, but the "collisions" part can turn it into O(N) as demonstrated (in a crude way) by the script below. Therefore the O(1) classification is a bit misleading. My other script was simplified. I did look at more data points. The curve is amazingly flat but not a constant function. It is frustrating that simple requests like wanting a useful True or False, that costs nothing, instead of a useless None, tend to digress like this ... import os class normal(object): def __init__(O, value): O.value = value def __hash__(O): return O.value.__hash__() class bad(object): def __init__(O, value): O.value = value def __hash__(O): return 0 for N in [1000, 10000]: t0 = os.times()[0] s = set([normal(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N t0 = os.times()[0] s = set([bad(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N

On Thu, Feb 12, 2009 at 10:46 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
It may seem like just a simple feature request, but by not allowing it, we preserve the consistency, cleanness, and relative purity of Python, and I wouldn't trade these aspects of Python for anything, least of which a small feature request that would only yield an very small gain if approved. As The Zen says: "Special cases aren't special enough to break the rules." You're asking for the "special case" of a mutating method of 'set' to break the general "rule" that mutating methods normally return None; it's simply not special /enough/ to justify breaking this very handy (and imho, intuitive) rule of thumb; or at least that seems to be the consensus. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com

Chris Rebert wrote:
As The Zen says: "Special cases aren't special enough to break the rules."
What might be more acceptable is to add a new method for this, with a name suggesting that it's more than just a plain mutating operation, e.g. was_it_there = myset.test_and_add(42) -- Greg

Greg Ewing wrote:
What's the use-case for this? What's wrong with doing this? if 42 in myset: myset.add(42) Short, sweet, and doesn't require any new methods. The OP's original use-case was based on his misapprehension that key lookup in a set was O(N log N). I don't see any useful advantage to a new method, let alone a change in semantics to the existing method. -- Steven

Steven D'Aprano wrote:
Nothing, other than it feels a little bit wrong looking the value up twice when once would be enough. I'm not particularly worried about this, I was just pointing out that adding a new method would be less of a violation of the Zen than changing an existing one. -- Greg

Steven D'Aprano <steve@pearwood.info> wrote:
Well, for me, What's wrong is that it's complex to write and debug, mainly. Don't you mean to say, was_it_there = (42 in myset) if not was_it_there: myset.add(42) for instance? Not that I'm pushing for the addition of this method, but I see the point. Bill

[Steven D'Aprano]
I think you meant: "if 42 not in myset". Am -1 on the proposed change. AFAICT, it will never save more than one line of code. The current version is explicit and clear. Moreover, it does not require special knowledge to read. Currently, the set API has a zero learning curve and doesn't demand that you remember a rule particular to Python. In contrast, the proposed version requires that you know that Python has a special API and you need to remember that when you're reading someone else's code. Give this snippet to five Python programmers (not in this thread) and see if they correctly deduce when f(x), f(y), f(z), and f(m) will be called. Also, see if they can correctly assert known post-conditions (after each if-statement and before each function call). if(not myset.add(x)): f(x) else: g(x) if(myset.discard(y)): f(y) else: g(y) if(myset.update(z): f(z) else: g(z) if(myset): f(m) else: g(m) As I mentioned in another post, this API is counterintuitive for anyone who is used to other languages where operations like set.add() always returns self and lets them write code like: myset.add(x).add(y).add(z) Essentially the argument against boils down to there not being an intuitively obvious correct interpretation of what the code does while the existing approach is crystal clear. The set API currently has no rough edges. It is one of the cleanest APIs in the language and I hope it stays that way. Raymond

On Mon, Feb 16, 2009 at 6:13 PM, Raymond Hettinger <python@rcn.com> wrote:
Point taken, but...
Which language has that?
Hey Raymond, relax. You don't have to take it personal. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

I think there a several. Smalltalk comes to mind: ''' addAll: aCollection Adds all the elements of 'aCollection' to the receiver, answer aCollection ''' Looks like Objective C takes the same approach: ''' Method: IndexedCollection @keyword{-insertObject:} newObject @keyword{atIndex:} (unsigned)index @deftypemethodx IndexedCollecting {} @keyword{-insertElement:} (elt)newElement @keyword{atIndex:} (unsigned)index returns self. ''' Also, Ruby uses the style of having mutating methods return self: ''' arr.fill( anObject ) -> arr arr.fill( anObject, start [, length ] ) -> arr arr.fill( anObject, aRange ) -> arr hsh.update( anOtherHash ) -> hsh Adds the contents of anOtherHash to hsh, overwriting entries with duplicate keys with those from anOtherHash. h1 = { "a" => 100, "b" => 200 } h2 = { "b" => 254, "c" => 300 } h1.update(h2) » {"a"=>100, "b"=>254, "c"=>300} ''' I haven't checked other languages but am sure I've seen this style show-up in a number of places. Raymond

Steven D'Aprano <steve@...> writes:
They are not O(1) amortized, but O(1) best case. The worst case being O(N) (if all keys fall in the same hash bucket). The whole game with hash tables is to find a sufficiently smart hash function such as keys are equiprobably distributed amongst buckets and lookup cost is O(1) rather than O(n). Hash functions can be improved over time, a recent example can be found at http://bugs.python.org/issue5186. Regards Antoine.

[Ralf W. Grosse-Kunstleve]
Three thoughts: * insertion and lookup times are O(1), not O(n log n). * because of caching the second lookup is very cheap. * the bdfl frowns on mutating methods returning anything at all
if (s.add(1)): do_something_else()
I would find this to be useful but don't find it to be a significant improvement over the original. Also, I find it to be a bit tricky. Currently, sets have a nearly zero learning curve and are not imbued with non-obvious behaviors. The proposed form requires that the reader knows about the special boolean found/notfound behavior. Also, since some people do want mutating methods to return a copy of the collection (i.e. s.add(1).add(2).add(3)), those folks will find your suggestion to be counter-intuitive. Raymond

* insertion and lookup times are O(1), not O(n log n).
Sorry, I was thinking the .add() is happening inside a loop with N iterations, with N also being the eventual size of the set. Is O(N log N) correct then? http://en.wikipedia.org/wiki/Big_O_notation says: O(log N) example: finding an item in a sorted array with a binary search. Isn't that what set is doing?

On Thu, Feb 12, 2009 at 2:59 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
Python's set uses an unsorted hash table internally and is thus O(1), not O(N). Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com

This is at odds with the results of a simple experiment. Please try the attached script. On my machine I get these results: lookup repeats: 1000000 timing set lookup N: 10000 overhead: 1.16 lookup contained: 0.07 lookup missing: 0.12 timing set lookup N: 10000000 overhead: 1.13 lookup contained: 0.42 lookup missing: 0.34 timing dict lookup N: 10000 overhead: 1.16 lookup contained: 0.06 lookup missing: 0.11 timing dict lookup N: 10000000 overhead: 1.12 lookup contained: 0.44 lookup missing: 0.44 N is the size of a set or dict. The number of lookups of random elements is the same in both cases. That's why the "overhead" is constant. Of course, the memory cache does have an influence here, but your claim is also at odds with anything I could find in the literature. A dictionary or set has to do some kind of search to find a key or element. Dictionary and set lookups have O(log N) runtime complexity, where N is the size of the dict or set at the time of the lookup. To come back to my original suggestion, Raymond's reply...
* because of caching the second lookup is very cheap.
makes me think that this...
is basically as fast as I was hoping to get from:
if (s.add(1)): do_something_else()
So I guess that kills my runtime argument.
* the bdfl frowns on mutating methods returning anything at all
Shock! What about dict.setdefault()? Honestly, I found it very odd from the start that set.add() doesn't tell me what it did. import random import time import os import sys repeats = 1000000 print "lookup repeats:", repeats for now_with_dict in [False, True]: random.seed(0) for N in [10000, 10000000]: if (not now_with_dict): print "timing set lookup" s = set(xrange(N)) else: print "timing dict lookup" s = {} for i in xrange(N): s[i] = 0 print " N:", len(s) t0 = os.times()[0] for i in xrange(repeats): j = random.randrange(N) assert j < N oh = os.times()[0]-t0 print " overhead: %.2f" % oh t0 = os.times()[0] for i in xrange(repeats): j = random.randrange(N) assert j in s print " lookup contained: %.2f" % (os.times()[0]-t0-oh) t0 = os.times()[0] for i in xrange(repeats): j = N + random.randrange(N) assert j not in s print " lookup missing: %.2f" % (os.times()[0]-t0-oh)

On Thu, Feb 12, 2009 at 8:36 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
The literature you found was assuming a binary tree implementation, which is O(log n). Python's dict and set use a hash table, which is O(1). The performance effect you found was because of the CPU's cache. Besides, you really should use more than 2 samples if you want to show an O(log n) performance curve. -- Adam Olsen, aka Rhamphoryncus

Ralf W. Grosse-Kunstleve wrote:
Python's set uses an unsorted hash table internally and is thus O(1), not O(N).
This is at odds with the results of a simple experiment.
Timing experiments are tricky to get right on multi-processing machines, which is virtually all PCs these days. For small code snippets, you are better off using the timeit module rather than re-inventing the wheel. That will have the very desirable outcome that others reading your code are dealing with a known and trusted component, rather than having to work out how you are doing your timing.
How do you determine that something fits a log N curve from just two data points? There's an infinite number of curves that pass through two data points. It's true that the results you found aren't consistent with O(1), but as I understand it, Python dicts are O(1) amortized ("on average over the long term"). Sometimes dicts resize, which is not a constant time operation, and sometimes the dict has to walk a short linked list, which depends on the proportion of hashes that lead to a collisions. But more importantly, I don't think you're necessarily measuring what you think you're measuring. I see that you include a call to random.randrange(N) within the timing loop. I don't think there is any guarantee that randrange(N) will take the same amount of time for any N. I'm not sure if that is actually the cause of your results, but it is a potential issue. When timing, you should try to time the barest minimum of code. This gives a quick demonstration of constant look-up time for sets:
Regards, -- Steven

Thanks for the insight. I didn't know such a process is considered O(1). I agree it is fair because in practice it seems to work very well, but the "collisions" part can turn it into O(N) as demonstrated (in a crude way) by the script below. Therefore the O(1) classification is a bit misleading. My other script was simplified. I did look at more data points. The curve is amazingly flat but not a constant function. It is frustrating that simple requests like wanting a useful True or False, that costs nothing, instead of a useless None, tend to digress like this ... import os class normal(object): def __init__(O, value): O.value = value def __hash__(O): return O.value.__hash__() class bad(object): def __init__(O, value): O.value = value def __hash__(O): return 0 for N in [1000, 10000]: t0 = os.times()[0] s = set([normal(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N t0 = os.times()[0] s = set([bad(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N

On Thu, Feb 12, 2009 at 10:46 PM, Ralf W. Grosse-Kunstleve <rwgk@yahoo.com> wrote:
It may seem like just a simple feature request, but by not allowing it, we preserve the consistency, cleanness, and relative purity of Python, and I wouldn't trade these aspects of Python for anything, least of which a small feature request that would only yield an very small gain if approved. As The Zen says: "Special cases aren't special enough to break the rules." You're asking for the "special case" of a mutating method of 'set' to break the general "rule" that mutating methods normally return None; it's simply not special /enough/ to justify breaking this very handy (and imho, intuitive) rule of thumb; or at least that seems to be the consensus. Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com

Chris Rebert wrote:
As The Zen says: "Special cases aren't special enough to break the rules."
What might be more acceptable is to add a new method for this, with a name suggesting that it's more than just a plain mutating operation, e.g. was_it_there = myset.test_and_add(42) -- Greg

Greg Ewing wrote:
What's the use-case for this? What's wrong with doing this? if 42 in myset: myset.add(42) Short, sweet, and doesn't require any new methods. The OP's original use-case was based on his misapprehension that key lookup in a set was O(N log N). I don't see any useful advantage to a new method, let alone a change in semantics to the existing method. -- Steven

Steven D'Aprano wrote:
Nothing, other than it feels a little bit wrong looking the value up twice when once would be enough. I'm not particularly worried about this, I was just pointing out that adding a new method would be less of a violation of the Zen than changing an existing one. -- Greg

Steven D'Aprano <steve@pearwood.info> wrote:
Well, for me, What's wrong is that it's complex to write and debug, mainly. Don't you mean to say, was_it_there = (42 in myset) if not was_it_there: myset.add(42) for instance? Not that I'm pushing for the addition of this method, but I see the point. Bill

[Steven D'Aprano]
I think you meant: "if 42 not in myset". Am -1 on the proposed change. AFAICT, it will never save more than one line of code. The current version is explicit and clear. Moreover, it does not require special knowledge to read. Currently, the set API has a zero learning curve and doesn't demand that you remember a rule particular to Python. In contrast, the proposed version requires that you know that Python has a special API and you need to remember that when you're reading someone else's code. Give this snippet to five Python programmers (not in this thread) and see if they correctly deduce when f(x), f(y), f(z), and f(m) will be called. Also, see if they can correctly assert known post-conditions (after each if-statement and before each function call). if(not myset.add(x)): f(x) else: g(x) if(myset.discard(y)): f(y) else: g(y) if(myset.update(z): f(z) else: g(z) if(myset): f(m) else: g(m) As I mentioned in another post, this API is counterintuitive for anyone who is used to other languages where operations like set.add() always returns self and lets them write code like: myset.add(x).add(y).add(z) Essentially the argument against boils down to there not being an intuitively obvious correct interpretation of what the code does while the existing approach is crystal clear. The set API currently has no rough edges. It is one of the cleanest APIs in the language and I hope it stays that way. Raymond

On Mon, Feb 16, 2009 at 6:13 PM, Raymond Hettinger <python@rcn.com> wrote:
Point taken, but...
Which language has that?
Hey Raymond, relax. You don't have to take it personal. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

I think there a several. Smalltalk comes to mind: ''' addAll: aCollection Adds all the elements of 'aCollection' to the receiver, answer aCollection ''' Looks like Objective C takes the same approach: ''' Method: IndexedCollection @keyword{-insertObject:} newObject @keyword{atIndex:} (unsigned)index @deftypemethodx IndexedCollecting {} @keyword{-insertElement:} (elt)newElement @keyword{atIndex:} (unsigned)index returns self. ''' Also, Ruby uses the style of having mutating methods return self: ''' arr.fill( anObject ) -> arr arr.fill( anObject, start [, length ] ) -> arr arr.fill( anObject, aRange ) -> arr hsh.update( anOtherHash ) -> hsh Adds the contents of anOtherHash to hsh, overwriting entries with duplicate keys with those from anOtherHash. h1 = { "a" => 100, "b" => 200 } h2 = { "b" => 254, "c" => 300 } h1.update(h2) » {"a"=>100, "b"=>254, "c"=>300} ''' I haven't checked other languages but am sure I've seen this style show-up in a number of places. Raymond

Steven D'Aprano <steve@...> writes:
They are not O(1) amortized, but O(1) best case. The worst case being O(N) (if all keys fall in the same hash bucket). The whole game with hash tables is to find a sufficiently smart hash function such as keys are equiprobably distributed amongst buckets and lookup cost is O(1) rather than O(n). Hash functions can be improved over time, a recent example can be found at http://bugs.python.org/issue5186. Regards Antoine.
participants (10)
-
Adam Olsen
-
Antoine Pitrou
-
Bill Janssen
-
Chris Rebert
-
Greg Ewing
-
Guido van Rossum
-
Leonardo Santagada
-
Ralf W. Grosse-Kunstleve
-
Raymond Hettinger
-
Steven D'Aprano