exclusively1, common, exclusively2 = set1 - set2, set1 & set2, set2 - set1

I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out 1. What passed first time round 2. What passed both times. 3. What passed only the second time round. I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful. I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals? I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library? Maybe defined as: exclusively1, common, exclusively2 = set1.partition(set2)

On 03/07/2013 21:50, Paddy3118 wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
How common is this need? I suspect that it's not that common. As to "seems wasteful", is it a performance bottleneck? I also suspect not (you even say "seems"! :-)).

On 3 July 2013 21:50, Paddy3118 <paddy3118@gmail.com> wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You can do it in one traversal of each set: def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly Oscar

On Wed, Jul 3, 2013 at 7:46 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
I don't understand why that can be more efficient that using the built-in operations: def partition(setx, sety): common = setx & sety return setx - common, common, sety - common I assume that the built-in operations traverse over the set with the smallest size, and preallocate the result to that size. -- Juancarlo *Añez*

On 04/07/13 11:41, Juancarlo Añez wrote:
On Wed, Jul 3, 2013 at 7:46 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
I don't understand why that can be more efficient that using the built-in operations:
def partition(setx, sety): common = setx & sety return setx - common, common, sety - common
I assume that the built-in operations traverse over the set with the smallest size, and preallocate the result to that size.
I don't think you can preallocate sets (or dicts) that way, since the offset of each item can depend on what items were already added in which order. So you can't know how many gaps to leave between items ahead of time, or where to put them. The obvious algorithm for set1 - set2 would be: def diff(set1, set2): result = set() for item in set1: if item not in set2: result.add(item) return result Are you suggesting this instead? def diff(set1, set2): if len(set1) <= len(set2): result = set() for item in set1: if item not in set2: result.add(item) else: result = set1.copy() for item in set2: if item in set1: result.remove(item) return result Either way, you still have to traverse set1 in full, either explicitly, or when you copy it. The overhead of copying the set when you might end up removing all the items suggests to me that there is no benefit in iterating over the smaller set. Likewise, set2 - set1 has to traverse set2 in full, regardless of which is smaller. set1 & set2 has to check every element of both sets. So that's four traversals in total, each set getting traversed twice. Oscar's version above only traverses each set once, making two in total. However, being in pure Python, it's probably an order of magnitude slower than calling the C set methods. That's my guess, I leave it to someone else to actually time the code and see :-) -- Steven

Oscar Benjamin wrote:
On 3 July 2013 21:50, Paddy3118 <paddy3118@gmail.com> wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You can do it in one traversal of each set:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
The price you pay for the symmetry is that for the second iteration of the outer loop xandy.add(val) is a noop. When you give up that symmetry you can replace one inner loop with a set operation: def partition(old, new): removed = old - new common = set() added = set() for item in new: if item in old: common.add(item) else: added.add(item) return removed, common, added

On Jul 4, 2013 Oscar Benjamin wrote:
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You can do it in one traversal of each set:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
JFYI, despite using two passes this long partition() is twice slower than simple: def partition(set1, set2): common = set1 & set2 return set1 - common, common, set2 - common That's because both functions are O(n) but the short one is just a few native calls, while long one uses lots of small calls, and thus overhead of so many function calls *significantly* exceeds time of one additional pass. Simple is better than complex. Readability counts. :) -- $ ./partition-test.py 2-lines partition: 0.5874 seconds 8-lines partition: 1.4138 seconds 9-lines partition: 0.8485 seconds #partition-test.py from time import time def partition1(set1, set2): common = set1 & set2 return set1 - common, common, set2 - common def partition2(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly def partition3(old, new): removed = old - new common = set() added = set() for item in new: if item in old: common.add(item) else: added.add(item) return removed, common, added if __name__ == "__main__": set1 = set(range(0, 3000000)) set2 = set(range(2000000,4000000)) t = time() partition1(set1, set2) print "2-lines partition:", time()-t, "seconds" t = time() partition2(set1, set2) print "8-lines partition:", time()-t, "seconds" t = time() partition3(set1, set2) print "9-lines partition:", time()-t, "seconds"

On 4 July 2013 13:33, Sergey <sergemp@mail.ru> wrote:
On Jul 4, 2013 Oscar Benjamin wrote:
You've missed the attribution line for the OP here so I'll add it back:
Paddy3118 wrote:
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
Note the question I'm responding to below.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You can do it in one traversal of each set:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
JFYI, despite using two passes this long partition() is twice slower than simple: def partition(set1, set2): common = set1 & set2 return set1 - common, common, set2 - common
That's because both functions are O(n) but the short one is just a few native calls, while long one uses lots of small calls, and thus overhead of so many function calls *significantly* exceeds time of one additional pass.
Simple is better than complex. Readability counts. :)
Since two or three people seem to have misunderstood what I wrote I'll clarify that the intention was to demonstrate the existence of a single pass algorithm. I was not trying to show the fastest CPython code for this problem. As Peter Otten has pointed out the algorithm is a little redundant so I'll improve it: def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for x in setx: if x in sety: xandy.add(x) else: xonly.add(x) for y in sety: if y not in setx: yonly.add(y) return xonly, xandy, yonly If you're worried about optimal branch flow for large sets then you might consider swapping so that the first loop is over the smaller of the two sets (assuming that the for/if branch is slightly faster than the for/else branch). However the swapping difference is not really an algorithmic difference in the sense that it would be for computing only the intersection. Oscar

On Thu, Jul 4, 2013 at 9:45 AM, Oscar Benjamin <oscar.j.benjamin@gmail.com>wrote:
If you're worried about optimal branch flow for large sets then you might consider swapping so that the first loop is over the smaller of the two sets (assuming that the for/if branch is slightly faster than the for/else branch). However the swapping difference is not really an algorithmic difference in the sense that it would be for computing only the intersection.
One of my points was (I haven't looked) that the native set operations probably do the swapping, and preallocate the result sets. Preallocation is an important performance booster, and may be part of an algorithm. In the double-loop implementation, one could try to copy() to the resulting sets, and then use remove() instead of add(). The swapping based on length is also easy to do. I guess that the point is that, in algorithm analysis, when the O() is good, like O(N), one should take a look at the constant parts part of the actual O(N*C+M*D) (for example), as they make all the difference. These kind of considerations went into Python's choice of implementation for sort(), for example. -- Juancarlo *Añez*

Hi Sergey, I thought that set partitioning might be a worthy candidate for implementation in C if an algorithm that cut down on the set traversals could be found. Yep, I didn't think such an algorithm in Python would likely be any quicker. I guess there are two parts to my question: 1. Does an algorithm using less passes exist? Yes, thanks. 2. Would a C implementation be a worthwhile addition to Python? - Ongoing... - Paddy. On Thursday, 4 July 2013 13:33:35 UTC+1, Sergey wrote:
On Jul 4, 2013 Oscar Benjamin wrote:
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You can do it in one traversal of each set:
def partition(setx, sety): xonly, xandy, yonly = set(), set(), set() for set1, set2, setn in [(setx, sety, xonly), (sety, setx, yonly)]: for val in set1: if val in set2: xandy.add(val) else: setn.add(val) return xonly, xandy, yonly
JFYI, despite using two passes this long partition() is twice slower than simple: def partition(set1, set2): common = set1 & set2 return set1 - common, common, set2 - common
That's because both functions are O(n) but the short one is just a few native calls, while long one uses lots of small calls, and thus overhead of so many function calls *significantly* exceeds time of one additional pass.
Simple is better than complex. Readability counts. :)

Hi Paddy, long time since I've read something from you, welcome back. On 04/07/13 06:50, Paddy3118 wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You could go from: exclusively1 = set1 - set2 common = set1 & set2 exclusively2 = set2 - set1 (which I expect ends up traversing each set twice) to this: common, exclusively1, exclusively2 = set(), set(), set() for item in set1: if item in set2: common.add(item) else: exclusively1.add(item) for item in set2: if item in set1: common.add(item) else: exclusively2.add(item) which only does two set traversals. But I would expect that using the set operators would be faster. (Iterating over the sets 3 or 4 times in C is likely to be faster than iterating over them 2 times in pure Python.)
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
[bikeshed] I would expect common, only1, only2 in that order. Does any other language with sets offer this as a set primitive? I too have needed it often enough that I'd be +1 on adding it. -- Steven

(Thanks Steven for the welcome back. I've been advocating Python through the Rosetta Code site). I couldn't think of Oscar and your solution but now that we have a Python algorithm, I understand that it could only be really tested if there were an implementation in C to test against the incumbent method. To give a bit more info on how much it is used: I have even incorporated the partition-type functionality in Python in throw-away bash shell scripts of the form: munge.sh data1 | sort > results1.txt munge.sh data2 | sort > results2 .txt python -c ' set1 = set(file("results1.txt")) set2 = set(file("results2.txt")) print([len(p) for p in (set1 - set2, set1 & set2, set2 - set1)]) ' I have used the above several times when my boss touches base with me and asks the informal "hi, and how are you getting on"? The above isn't my only use. I have used the construct enough times for me to think of it as a "design pattern". Before I posted, I looked up set partition on Wikipedia. It has a different meaning for a partition of a set, but doesn't apply any meaning to the partition of two sets leaving the dyadic meaning of partition vacant :-) - Paddy. On Thursday, 4 July 2013 03:10:28 UTC+1, Steven D'Aprano wrote:
Hi Paddy, long time since I've read something from you, welcome back.
On 04/07/13 06:50, Paddy3118 wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
You could go from:
exclusively1 = set1 - set2 common = set1 & set2 exclusively2 = set2 - set1
(which I expect ends up traversing each set twice) to this:
common, exclusively1, exclusively2 = set(), set(), set() for item in set1: if item in set2: common.add(item) else: exclusively1.add(item) for item in set2: if item in set1: common.add(item) else: exclusively2.add(item)
which only does two set traversals. But I would expect that using the set operators would be faster. (Iterating over the sets 3 or 4 times in C is likely to be faster than iterating over them 2 times in pure Python.)
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
[bikeshed] I would expect common, only1, only2 in that order.
Does any other language with sets offer this as a set primitive? I too have needed it often enough that I'd be +1 on adding it.
-- Steven _______________________________________________ Python-ideas mailing list Python...@python.org <javascript:> http://mail.python.org/mailman/listinfo/python-ideas

On 4 July 2013 03:10, Steven D'Aprano <steve@pearwood.info> wrote:
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
[bikeshed] I would expect common, only1, only2 in that order.
I like the other order. It reads like a horizontal Venn diagram e.g.: http://upload.wikimedia.org/wikipedia/commons/9/99/Venn0001.svg Although it probably wouldn't be the best order if the function were somehow generalised to more than two sets (not that I can think of a useful and intelligible generalisation).
Does any other language with sets offer this as a set primitive?
I'd be interested to know what they call it if they do. "Partition" seems wrong to me. If a set had a partition method I would expect it give me a partition of that set i.e. a set of disjoint subsets covering the original set. This is related to the partition of the set in the sense that if xonly, xandy, yonly = set.partition(setx, sety) then {xonly, xandy} is a partition of setx and {yonly, xandy} is a partition of sety and {xonly, xandy, yonly} is a partition of the union setx | sety. I think, though, that I would expect a set.partition method to work something like true_set, false_set = setx.partition(lambda x: predicate(x)) with the obvious semantics. Then you could get a single pass algorithm for the original problem with xandy, xonly = setx.partition(sety.__contains__) yonly = sety - xandy Oscar

On Friday, 5 July 2013 07:19:24 UTC+1, Greg Ewing wrote:
Oscar Benjamin wrote:
On 4 July 2013 03:10, Steven D'Aprano <st...@pearwood.info <javascript:>> wrote:
Does any other language with sets offer this as a set primitive?
I'd be interested to know what they call it if they do.
I think it should be called vennerate().
Vennerate? Cute; but we have str1.partition(str2), and set1.partition(set2) would be mathematical partition of the union of both sets returning a tuple of three elements - just like str.partition, which makes me vote for the name "partition".

On 5 July 2013 08:09, Paddy3118 <paddy3118@gmail.com> wrote:
On Friday, 5 July 2013 07:19:24 UTC+1, Greg Ewing wrote:
Oscar Benjamin wrote:
On 4 July 2013 03:10, Steven D'Aprano <st...@pearwood.info> wrote:
Does any other language with sets offer this as a set primitive?
I'd be interested to know what they call it if they do.
I think it should be called vennerate().
Vennerate? Cute; but we have str1.partition(str2), and set1.partition(set2) would be mathematical partition of the union of both sets returning a tuple of three elements - just like str.partition, which makes me vote for the name "partition".
I didn't know about the str.partition method but having looked at it now what it does is closer to what I think of as a partition of a set i.e. after head, sep, tail = str1.partition(str2) we have that str1 == head + sep + tail By analogy I would expect that after set2, set3, set4 = set1.partition(...) we would have set1 == set2 | set3 | set4 because that is (one property of) a "partition" of set1. However this is not the meaning you intend. Oscar

On 5 July 2013 10:48, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
I didn't know about the str.partition method but having looked at it now what it does is closer to what I think of as a partition of a set i.e. after
head, sep, tail = str1.partition(str2)
we have that
str1 == head + sep + tail
By analogy I would expect that after
set2, set3, set4 = set1.partition(...)
we would have
set1 == set2 | set3 | set4
because that is (one property of) a "partition" of set1. However this is not the meaning you intend.
Hence why it should be defined: set1.partition(set2) === set1 - set2, set1 & set 2 That lets you have the "single pass" of each set as before, but is a "smaller" operation. The full partition would just plot "set2 - set1" on the end: ven_a, ven_shared = a.partition(b) ven_b = ven_b - ven_shared And "ven_a | ven_shared == a" as you said.

On Friday, 5 July 2013 22:38:22 UTC+1, Joshua Landau wrote:
...
Hence why it should be defined:
set1.partition(set2) === set1 - set2, set1 & set 2
That lets you have the "single pass" of each set as before, but is a "smaller" operation. The full partition would just plot "set2 - set1" on the end:
ven_a, ven_shared = a.partition(b) ven_b = ven_b - ven_shared
And "ven_a | ven_shared == a" as you said. _______________________________________________
Unfortunately I tend to need all three of exclusively1, common, exclusively2 . Assuming fast C implementations, then your proposal would lead to less opportunity for optimization and the need to create one needed result in Python. The need for all three and the potentially better optimization afforded to computing three at once in C should outweigh any considerations of the name of the method.

On 6 July 2013 05:56, Paddy3118 <paddy3118@gmail.com> wrote:
On Friday, 5 July 2013 22:38:22 UTC+1, Joshua Landau wrote:
Hence why it should be defined:
set1.partition(set2) === set1 - set2, set1 & set 2
That lets you have the "single pass" of each set as before, but is a "smaller" operation. The full partition would just plot "set2 - set1" on the end:
ven_a, ven_shared = a.partition(b) ven_b = ven_b - ven_shared
And "ven_a | ven_shared == a" as you said.
Unfortunately I tend to need all three of exclusively1, common, exclusively2 . Assuming fast C implementations, then your proposal would lead to less opportunity for optimization and the need to create one needed result in Python. The need for all three and the potentially better optimization afforded to computing three at once in C should outweigh any considerations of the name of the method.
I don't believe it would be any (measurably) slower, given the single-pass optimisations shown in other posts. I could be wrong, but it's not like Python's known for it's micro-optimisations (except for dicts...). I also think that my way is better *because* of efficiency concerns -- if it's not measurably slower for the 3-way split, surely it's more useful to allow people to have a fast 2-way split. It's also a simpler, less redundant implementation. I could give you some uses of my partition over 3-way partitioning: flags = set(...) flags_passed_to_next_in_chain, flags_kept = flags.partition(flags_that_I_want) The rest are all variations of this.

On 05/07/13 19:48, Oscar Benjamin wrote:
On 5 July 2013 08:09, Paddy3118 <paddy3118@gmail.com> wrote:
On Friday, 5 July 2013 07:19:24 UTC+1, Greg Ewing wrote:
Oscar Benjamin wrote:
On 4 July 2013 03:10, Steven D'Aprano <st...@pearwood.info> wrote:
Does any other language with sets offer this as a set primitive?
I'd be interested to know what they call it if they do.
I think it should be called vennerate().
Groan. That's a terrible pun :-)
Vennerate? Cute; but we have str1.partition(str2), and set1.partition(set2) would be mathematical partition of the union of both sets returning a tuple of three elements - just like str.partition, which makes me vote for the name "partition".
I didn't know about the str.partition method but having looked at it now what it does is closer to what I think of as a partition of a set i.e. after
head, sep, tail = str1.partition(str2)
we have that
str1 == head + sep + tail
By analogy I would expect that after
In this case, what's being partitioned isn't set1, but the union of the two sets.
set2, set3, set4 = set1.partition(...)
we would have
set1 == set2 | set3 | set4
because that is (one property of) a "partition" of set1. However this is not the meaning you intend.
set1 | set2 = only1 | common | only2 [only a quarter serious] In a way, perhaps this ought to be an operator, and return a set of three (frozen)sets. "partition" is not the ideal name for this method, but the actual operation itself is very useful. I have often done this, mostly on dict views rather than sets. In my head, it's an obvious operation, to split a pair of sets into three, spoiled only by lack of a good name. Perhaps "split" is a reasonable name? only1, both, only2 = set1.split(set2) Hmmm... seems reasonable to me. -- Steven

Steven D'Aprano writes:
"partition" is not the ideal name for this method,
+1 It would completely confuse anybody who did want a partition.
but the actual operation itself is very useful. I have often done this, mostly on dict views rather than sets. In my head, it's an obvious operation, to split a pair of sets into three, spoiled only by lack of a good name.
Perhaps "split" is a reasonable name?
only1, both, only2 = set1.split(set2)
-1 Set splitting is an intractable problem. https://en.wikipedia.org/wiki/Set_splitting_problem That wouldn't bother me all that much except that I can imagine all kinds of ways to split sets that have little to do with boolean algebra (starting with Dedekind cuts, you see how messy this will get). This particular operation is quite a ways down my list of candidates. I propose "join".[1] If we consider set1 and set2 as implicitly defining the partition of some universe into set1 and complement of set1, and respectively the partition of the same universe into set2 and its complement, then what you have here is the lattice-theoretic join of the two partitions (the coarsest common refinement), with (as before) "everything else in the universe" being implied (ie, left out of the result). This also generalizes well to more than two sets. I suspect that anybody who wants a true partition join will define a partition class, and will define their own join. So this slight abuse of terminology probably won't cause that much trouble. Footnotes: [1] Or maybe it's the meet? I never can keep the two straight....

On 07/07/13 01:34, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
"partition" is not the ideal name for this method,
+1
It would completely confuse anybody who did want a partition.
but the actual operation itself is very useful. I have often done this, mostly on dict views rather than sets. In my head, it's an obvious operation, to split a pair of sets into three, spoiled only by lack of a good name.
Perhaps "split" is a reasonable name?
only1, both, only2 = set1.split(set2)
-1
Set splitting is an intractable problem. https://en.wikipedia.org/wiki/Set_splitting_problem
Paddy's suggested method is a concrete, conceptually simple operation on two finite, discrete sets, not some theoretical problem[1] from complexity theory. If we're going to reject method names because they have some vague relation to some obscure corner of mathematics that 99% of programmers will never have heard of, let alone care about, I think we're going to soon run out of good names.
That wouldn't bother me all that much except that I can imagine all kinds of ways to split sets that have little to do with boolean algebra (starting with Dedekind cuts, you see how messy this will get).
The string split method implements *one specific way* of splitting strings, by partitioning on some given delimiter. There are other ways of splitting, say by keeping the delimiter, or by partitioning the string in groups of N characters, or between pairs of brackets, etc. We don't reject the name "split" for strings just because there are alternative ways to split, and we shouldn't reject a simple, descriptive, understandable name "split" for sets just because there are other ways to split sets.
I propose "join".[1] ... [1] Or maybe it's the meet? I never can keep the two straight....
I don't think that either is appropriate. Join and meet are operations on a single set, not a pair of them: the join of a set is the least upper bound (effectively, the maximum) and the meet is the greatest lower bound (effectively, the minimum) of the set. They are not operations on two sets. http://en.wikipedia.org/wiki/Join_and_meet Alternatively, join and meet can be defined as binary operations on elements of the set, rather than on the set itself. But in any case, I don't think that a method that takes two sets as input and returns three sets should be called a "join". In plain English, when you join two things you get one thing, not three. And if we're going to reject set.partition because it doesn't behave quite the same as str.partition, then we should reject set.join because it doesn't behave anything even slightly like str.join, which is *far* more well-known than str.partition. This is clearly a convenience method. There's already a fast way to calculate the result, it just takes three calls instead of one. This would be a little faster and more convenient but it wouldn't change what we can do with sets. I already have a utility function in my toolbox to calculate this, so it would be a Nice To Have if it were a built-in set method, but not if it means spending three weeks arguing about the method name :-) [1] Not that I mean to imply that there is necessarily no concrete application for this problem. But being intractable, it is unlikely to be proposed or accepted as a set method. -- Steven

06.07.2013 19:26, Steven D'Aprano wrote:
On 07/07/13 01:34, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
"partition" is not the ideal name for this method,
+1
It would completely confuse anybody who did want a partition.
but the actual operation itself is very useful. I have often done this, mostly on dict views rather than sets. In my head, it's an obvious operation, to split a pair of sets into three, spoiled only by lack of a good name.
Perhaps "split" is a reasonable name?
only1, both, only2 = set1.split(set2)
-1
Set splitting is an intractable problem. https://en.wikipedia.org/wiki/Set_splitting_problem [...] I propose "join".[1] ... [1] Or maybe it's the meet? I never can keep the two straight....
I don't think that either is appropriate. Join and meet are operations on a single set, not a pair of them [...]
Other ideas for the method name: * trisect * trisection * cover * over * overlap * interfere Cheers. *j

On 07/07/13 07:40, Joshua Landau wrote:
(I still don't get what people have against my version though. A 2-way partition makes sense)
For the record, your suggestion was to add a convenience method: set1.partition(set2) => (set1 - set2, set1 & set2) and then call it as a two-liner: only1, both = set1.partition(set2) only2 = set2 - set1 instead of Paddy's suggestion (with the method name left unknown): only1, both, only2 = set1.???????(set2) I don't think much of your suggestion. It doesn't solve the stated use-case, where you want a three-way partition of two sets. And I'm having difficulty in thinking of practical examples where I might only want two out of the three "Venn subsets". Since this is a convenience method, it doesn't add any new functionality, it needs to be *more* rather than *less* convenient than what it replaces. -- Steven

On 7 July 2013 03:23, Steven D'Aprano <steve@pearwood.info> wrote:
On 07/07/13 07:40, Joshua Landau wrote:
(I still don't get what people have against my version though. A 2-way partition makes sense)
For the record, your suggestion was to add a convenience method:
set1.partition(set2) => (set1 - set2, set1 & set2)
and then call it as a two-liner:
only1, both = set1.partition(set2) only2 = set2 - set1
instead of Paddy's suggestion (with the method name left unknown):
only1, both, only2 = set1.???????(set2)
I don't think much of your suggestion. It doesn't solve the stated use-case, where you want a three-way partition of two sets.
Only it does, as the problem was that it was "wasteful" to do the extra passes over the list - which mine solves. If you really need a convenience function for a three-way partition it makes sense to put it in a function and call "partition3(set1, set2)". It looks better as a function, too, as the arguments "act" on each-other, rather than being unidirectional as the method-based syntax implies.
And I'm having difficulty in thinking of practical examples where I might only want two out of the three "Venn subsets".
I've already given one. When you have a set of <flags> and you only want to deal with <these ones> and pass <the rest> to the next in the chain, you'd want to do <flags>.partition(<these ones>). There are a lot of circumstances of this sort. Personally, I've no idea why you'd want all three. I'm willing, though, to accept that you have a good reason.
Since this is a convenience method, it doesn't add any new functionality, it needs to be *more* rather than *less* convenient than what it replaces.
I'm quite surprised you think it's less convenient than the default method. The main reason it was brought up was because of the "wastefulness". If it's really so bad to be forced to write a function, given that the original efficiency problems are solved, may I ask why we still have to write: d = dict1.copy() d.update(dict2) func(d) to pass a function the combination of two dictionaries (in my experience a far more needed utility)? I think that ship has sailed long ago¹. ¹ This may change soon if we get to do "func({**dict1, **dict2})", but that's not the primary motive for the change so w/e.

On Sat, Jul 6, 2013 at 7:23 PM, Steven D'Aprano <steve@pearwood.info> wrote:
For the record, your suggestion was to add a convenience method: set1.partition(set2) => (set1 - set2, set1 & set2) only1, both = set1.partition(set2) only2 = set2 - set1
instead of Paddy's suggestion (with the method name left unknown):
only1, both, only2 = set1.???????(set2)
I dislike the appearance of creating a method on the 'set' type. It creates an asymmetry between the respective sets that doesn't really express the sense of the symmetrical operation. However, I *do* think it would be worth having a faster (i.e. C-coded) way of doing the three-way partitioning. Moreover, there's really no reason that such a function couldn't operate on collections (or iterators) other than sets, and being general seems more useful. Therefore, I would suggest adding a C-coded function in the STDLIB--probably in 'collections' to do this. E.g. from collections import segment # tentative name, see various suggestions in thread only1, only2, intersection = segment(iter1, iter2) In behavior, this should do the same thing as the below (just faster): def segment(iter1, iter2): set1, set2 = map(set, (iter1, iter2)) return set1 - set2, set2 - set1, set1 & set2 -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Jul 6, 2013, at 19:53, David Mertz <mertz@gnosis.cx> wrote:
However, I *do* think it would be worth having a faster (i.e. C-coded) way of doing the three-way partitioning. Moreover, there's really no reason that such a function couldn't operate on collections (or iterators) other than sets, and being general seems more useful.
If we don't have "intersection" and "difference" functions that work on any pair of iterables, it seems strange to have this function do so. And I think there's a good reason we don't have them, and that same reason applies here: there actually _is_ an inherent asymmetry to all of these algorithms. One of the iterables has to be a set (or at least a Set with fast lookup); the other can be anything, with no change in semantics or performance. In fact, if the large iterable is an iterator, there could even be a substantial improvement in space if the result were three iterators. (Obviously O(M) is much better than O(N+M) when M << N.) However, going again by the parallel with the methods we already have, if we don't have itertools.set_difference we probably don't need itertools.whatever_this_is, just a method on set.

On Sunday, 7 July 2013 03:53:09 UTC+1, David Mertz wrote:
<snip>
from collections import segment # tentative name, see various
suggestions in thread only1, only2, intersection = segment(iter1, iter2)
In behavior, this should do the same thing as the below (just faster):
def segment(iter1, iter2): set1, set2 = map(set, (iter1, iter2)) return set1 - set2, set2 - set1, set1 & set2
Hi David, I really am not sure about generalizing the interface to iterators and then immediately turning them into sets in the implementation. I think the functionality can be naturally explained as an operation on two sets and should be restricted to sets. The caller should have to map other types to sets explicitly. If it turns out that the implemented algorithm in C would work just as well with one of the arguments being any finite iterator and the other needing to be a set then we could still stick with requiring two sets or change to a format of: set_only, common, iter_only = some_set.????(some_iterator)

Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would be. But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets. On Sat, Jul 6, 2013 at 10:28 PM, Paddy3118 <paddy3118@gmail.com> wrote:
On Sunday, 7 July 2013 03:53:09 UTC+1, David Mertz wrote:
<snip>
from collections import segment # tentative name, see various
suggestions in thread only1, only2, intersection = segment(iter1, iter2)
In behavior, this should do the same thing as the below (just faster):
def segment(iter1, iter2): set1, set2 = map(set, (iter1, iter2)) return set1 - set2, set2 - set1, set1 & set2
Hi David, I really am not sure about generalizing the interface to iterators and then immediately turning them into sets in the implementation. I think the functionality can be naturally explained as an operation on two sets and should be restricted to sets. The caller should have to map other types to sets explicitly.
If it turns out that the implemented algorithm in C would work just as well with one of the arguments being any finite iterator and the other needing to be a set then we could still stick with requiring two sets or change to a format of:
set_only, common, iter_only = some_set.????(some_iterator)
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable. -- Eric.

On Jul 7, 2013 2:09 PM, "Eric V. Smith" <eric@trueblade.com> wrote:
On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would
be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable.
I agree.
-- Eric. _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On Sun, Jul 07, 2013 at 02:37:56PM -0700, David Mertz wrote:
On Jul 7, 2013 2:09 PM, "Eric V. Smith" <eric@trueblade.com> wrote:
On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would
be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable.
I agree.
A class member? Do you mean a class *method*? I think it would be freaky and weird if I did this: some_set.venn_split(second_set, another_set) (for lack of a better name) and the value of some_set was ignored. Class methods are okay for things like alternate constructors, but I don't think they are appropriate here. -- Steven

Yeah, needs to be on sets/dictionary types otherwise question of equality versus identity comes into play. Not to mention the behavior of shared gets pretty ambiguous: if there is one of them in one and two in the other, do two of them go in common, one in common and xor2. I think it makes sense as a standard method of the set type, if it makes sense at all; and also that "vennsubs" might be a good name. Could accept any itterable composed of hashable (that's what makes a viable key, righ?) items. On Jul 8, 2013, at 12:22 AM, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Jul 07, 2013 at 02:37:56PM -0700, David Mertz wrote:
On Jul 7, 2013 2:09 PM, "Eric V. Smith" <eric@trueblade.com> wrote:
On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would
be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable.
I agree.
A class member? Do you mean a class *method*?
I think it would be freaky and weird if I did this:
some_set.venn_split(second_set, another_set)
(for lack of a better name) and the value of some_set was ignored. Class methods are okay for things like alternate constructors, but I don't think they are appropriate here.
-- Steven _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On 7/8/2013 3:22 AM, Steven D'Aprano wrote:
On Sun, Jul 07, 2013 at 02:37:56PM -0700, David Mertz wrote:
On Jul 7, 2013 2:09 PM, "Eric V. Smith" <eric@trueblade.com> wrote:
On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would
be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable.
I agree.
A class member? Do you mean a class *method*?
I did mean classmethod, thanks. Or maybe staticmethod, I haven't really thought it through. The point being, it need not be an instance method.
I think it would be freaky and weird if I did this:
some_set.venn_split(second_set, another_set)
(for lack of a better name) and the value of some_set was ignored. Class methods are okay for things like alternate constructors, but I don't think they are appropriate here.
set.venn_split(second_set, another_set) It's no more surprising than this code not using the values from d:
d = {'a':1, 'b':2} d.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
versus:
dict.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
-- Eric.

On 8 July 2013 09:24, Eric V. Smith <eric@trueblade.com> wrote:
I think it would be freaky and weird if I did this:
some_set.venn_split(second_set, another_set)
(for lack of a better name) and the value of some_set was ignored. Class methods are okay for things like alternate constructors, but I don't think they are appropriate here.
set.venn_split(second_set, another_set)
It's no more surprising than this code not using the values from d:
d = {'a':1, 'b':2} d.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
versus:
dict.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
Surely the point is that in s.venn_split(s1, s2) the *value* of s might be ignored, but the *type* of s should be the type of the results? So subclassing works "as expected" (for some value of the word "expected" :-)) Paul.

On Mon, Jul 08, 2013 at 04:24:39AM -0400, Eric V. Smith wrote:
On 7/8/2013 3:22 AM, Steven D'Aprano wrote:
On Sun, Jul 07, 2013 at 02:37:56PM -0700, David Mertz wrote:
On Jul 7, 2013 2:09 PM, "Eric V. Smith" <eric@trueblade.com> wrote:
On 7/7/2013 1:45 AM, David Mertz wrote:
Maybe the generalization isn't worthwhile. I was thinking that maybe a more general version should keep order in types that have order to start with, so I confess I'm not certain what the "correct" interface would
be.
But even if it were only for sets, I like the idea of a plain function much better than a method of a set, even if the only arguments it accepted were sets.
If it were added, I think a classmember on set would be reasonable.
I agree.
A class member? Do you mean a class *method*?
I did mean classmethod, thanks. Or maybe staticmethod, I haven't really thought it through. The point being, it need not be an instance method.
Strictly speaking, you're right. Being Python, you could make this any sort of callable you like. But what would be the point of making it something other than either an instance method or a function? It's an operation that requires two set arguments. The obvious way to handle it is as either a function with signature func(set1, set2) or as a method with signature method(self, other). There are no points for "Most Unusual and Imaginative API". (A third alternative would be an operator, __op__(self, other), but I'm not seriously suggesting that.)
I think it would be freaky and weird if I did this:
some_set.venn_split(second_set, another_set)
(for lack of a better name) and the value of some_set was ignored. Class methods are okay for things like alternate constructors, but I don't think they are appropriate here.
set.venn_split(second_set, another_set)
It's no more surprising than this code not using the values from d:
d = {'a':1, 'b':2} d.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
versus:
dict.fromkeys([3, 4, 5]) {3: None, 4: None, 5: None}
It is standard behaviour for alternate constructors to ignore the value of the instance, hence they are usually classmethods. A constructor normally depends on the class, not the instance: dict(args), not {}(args). And the args generally don't include an instance of the class you are trying to make. They can, of course, but generally we have things like Decimal.from_float(arg). Even when you call the constructor from an instance, you don't want the instance itself to make any difference to the result, the result should be specified entirely by the arguments. So constructors should ignore the instance, and should be classmethods. None of these things apply to "venn_split". We're not talking about a constructor, but a set operation that requires two set arguments. One is conventionally taken to be `self`, the other explicitly given: class set: def venn_split(self, other): ... not: class set: @classmethod def venn_split(cls, set1, set2): # ignore the value of cls ... Making this "venn-split" operation a class or static method would be as weird as making (say) str.split a class method: "a,b,c,d".split(",") => raise TypeError, too few arguments "a,b,c,d".split("a,b,c,d", ",") => ["a", "b", "c", "d"] That's a strange API. This is Python, you can do it, but you shouldn't. -- Steven

subsets? On Jul 6, 2013, at 1:06 PM, Jan Kaliszewski <zuo@chopin.edu.pl> wrote:
06.07.2013 19:26, Steven D'Aprano wrote:
On 07/07/13 01:34, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
"partition" is not the ideal name for this method,
+1
It would completely confuse anybody who did want a partition.
but the actual operation itself is very useful. I have often done this, mostly on dict views rather than sets. In my head, it's an obvious operation, to split a pair of sets into three, spoiled only by lack of a good name.
Perhaps "split" is a reasonable name?
only1, both, only2 = set1.split(set2)
-1
Set splitting is an intractable problem. https://en.wikipedia.org/wiki/Set_splitting_problem [...] I propose "join".[1] ... [1] Or maybe it's the meet? I never can keep the two straight....
I don't think that either is appropriate. Join and meet are operations on a single set, not a pair of them [...]
Other ideas for the method name:
* trisect * trisection * cover * over * overlap * interfere
Cheers. *j
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On Wednesday, 3 July 2013 21:50:35 UTC+1, Paddy3118 wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
- Others who have used this construct have now come to light which was one of my goals. - There is an algorithm that can save on the set traversals and that might be quicker when implemented in C - another of my goals. We are currently arguing on the name which I must admit does need to be agreed, but usually takes a disproportionate amount of time to agree. (Especially when we all think that of course my solution is obviously the best ;-) I re-read the Wikipedia article on Venn diagrams as I am indeed asking for a split of two sets into the regions depicted in a two variable Venn diagram. The punny name *vennerate *gains some credibility, but that awful pun just has to die a quick death. *vennsplit*, *venndiv*, *vennpartition*, *vpartition *are new name candidates on the same theme but without the pun. The division Of two two sets that I am asking for is related to truth tables set1 set2 : Resultant ==== ==== : ========= 0 0 : Items not in set1 and not in set2; - Dumb! 1 0 : Items in set1 and not in set2; exclusively1 0 1 : Items not in set1 and in set2; exclusively2 1 1 : Items not in set1 and not in set2; common The above leads to a generalization for more than two sets; for example three sets and a change in nomenclature for the resultants would lead to: (bin001, bin010, bin011, bin100, bin101, bin110, bin111) = set0.partition(set1, set2) ...But three sets and more should be abandoned I think as being too complicated for little gain.

On 07/06/2013 11:46 AM, Paddy3118 wrote:
We are currently arguing on the name which I must admit does need to be agreed, but usually takes a disproportionate amount of time to agree. (Especially when we all think that of course my solution is obviously the best ;-)
A good practice when it comes to names here, if a there isn't a obvious name to use, is to decide on a temporary name while the exact behaviour is being worked out. Later, if there is some consensus for adding the feature to python's library, collect name suggestions and then have an informal poll/discussion for the best name/spelling out of those suggestions. That has worked well in the past and avoids getting bogged down early on because of the name. Cheers, Ron

On Wednesday, 3 July 2013 21:50:35 UTC+1, Paddy3118 wrote:
I found myself repeating something that I know I have used before, several times: I get two sets of results, may be sets of the passing tests when a design has changed, and I need to work out what has changed so work out
1. What passed first time round 2. What passed both times. 3. What passed only the second time round.
I usually use something like the set equations in the title to do this but I recognise that this requires both sets to be traversed at least three times which seems wasteful.
I wondered if their was am algorithm to partition the two sets of data into three as above, but cutting down on the number of set traversals?
I also wondered that if such an algorithm existed, would it be useful enough to be worth incorporating into the Python library?
Maybe defined as:
exclusively1, common, exclusively2 = set1.partition(set2)
I've done a related blog entry Set divisions/partitions<http://paddy3118.blogspot.co.uk/2013/07/set-divisionspartitions.html> in which I try for a more general algorithm.
participants (17)
-
Andrew Barnert
-
David Mertz
-
Eric V. Smith
-
Greg Ewing
-
Jan Kaliszewski
-
Joshua Landau
-
Juancarlo Añez
-
MRAB
-
Oscar Benjamin
-
Paddy3118
-
Paul Moore
-
Peter Otten
-
Ron Adam
-
Sergey
-
Shane Green
-
Stephen J. Turnbull
-
Steven D'Aprano