
I would like to propose that the + operator be a synonym for | (union) on sets. IMHO, the succinct power of being able to say sum(list_of_sets, set()) as opposed to import operator reduce(operator.and_, list_of_sets, set()) far outweighs any problems with having two operators for union. The sum paradigm is certainly more readable as well. I realize that a function named "unionall" could be defined trivially, but with a built-in already serving the purpose in a readable and reasonable way, it seems silly to not use it. I anticipate that someone will ask why such a paradigm should be available for union as opposed to the other set operations. My answer is that several standard algorithms rely on a union over a sequence of sets; one example is the construction of the "follow" set in constructing an LL or SLR parser, and I know I've seen others that I cannot remember off the top of my head. There is even a mathematical symbol (big-U) for it akin to the summation sign (uppercase sigma). Any feedback? Brandon

Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as: big_u = set() for s in all_sets: big_u.update(s) I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib. George On Mon, Jun 2, 2008 at 7:48 AM, Brandon Mintern <bmintern@gmail.com> wrote:

On Mon, Jun 2, 2008 at 9:21 PM, George Sakkis <george.sakkis@gmail.com> wrote:
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing ** _iadd_ ** would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant. Brandon

On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern <bmintern@gmail.com> wrote:
No, it uses __add__: $ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set' You can easily see the quadratic behavior of __add__: $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(100)), Set())" 100 loops, best of 3: 2.4 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(200)), Set())" 100 loops, best of 3: 8.04 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(400)), Set())" 10 loops, best of 3: 33.3 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(800)), Set())" 10 loops, best of 3: 141 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(1600)), Set())" 10 loops, best of 3: 684 msec per loop George

On Mon, Jun 2, 2008 at 10:06 PM, George Sakkis <george.sakkis@gmail.com> wrote:
[snip] Ouch. Never mind my idea then. I do find that rather strange, though. It seems kind of strange to be able define an initial value but not use it as an accumulator. This means that sum on list, mutable user number classes, etc. is bound to be less efficient than it could be. Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not", and then maybe we can revisit this idea at that time. Honestly, what's wrong with sum being defined as: def sum (iterable, start=0): acc = start for i in iteratble: acc += i return acc Even though I'm making a lot of bad proposals, I sure am learning a lot. Brandon

On Tue, Jun 3, 2008 at 9:57 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Obviously, I meant to say "sum" there instead of "max" (which I'm pretty sure you realized as well) -- I had been using max at the time that I wrote that e-mail.
That was my intention, to take advantage of increased efficiency provided by mutable initial values. Unfortunately, I didn't consider the "existing code" problem, but that's not really a problem in Python 3K, is it? For example, sum(lists, []) currently runs in quadratic time (as pointed out by George Sakkis earlier in this thread using an example of sets that implement __add__). If instead, sum was implemented as: def sum (iterable, init=0): for i in iterable: init += i return init Then its behavior would mimic its current behavior for immutable types and other types which do not implement iadd, but for types that allow more efficient value modification, it would be a big win. Is there a good use case for a time when you wouldn't want an initial value to be mutated? In my experience, I've always passed in throwaway initial values anyways, like []. Brandon p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".

On Tue, Jun 03, 2008, Brandon Mintern wrote:
Because the culture is to prefer private replies when public replies are not necessary. Plus it provides a small level of insurance against really private replies going out in public. Here are the canonical URLs: http://www.unicom.com/pw/reply-to-harmful.html http://www.metasystema.net/essays/reply-to.mhtml -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "In many ways, it's a dull language, borrowing solid old concepts from many other languages & styles: boring syntax, unsurprising semantics, few automatic coercions, etc etc. But that's one of the things I like about it." --Tim Peters on Python, 16 Sep 1993

On 3 Jun 2008, at 02:21, George Sakkis wrote:
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have A = S.union(T, U, V) mean A = S.union(T) A.update(U) A.update(V) As a consequence of Python method implementation, one could write instead: A = set.union(S, T, U, V) B = set.intersection(S, T, U, V) which reads nicely -- Arnaud

+1. I use union and intersection as sum-like functions from time to time, enough to warrant implementation in my "standard utility import". Although I'm not sure about the interface. A sum-like interface which receives a sequence might be better, although they are pretty much equivalent. ------------------------- Imri Goldberg www.algorithm.co.il/blogs www.imri.co.il ------------------------- Insert Signature Here ------------------------- Mathias Panzenböck wrote:

Something like this has been on my todo list for a while. Patches are welcome. It should be done for union, intersection, difference, and symmetric difference. Some attempt should be made to optimize the ordering so that a&b&c&d will run from the smallest set to the largest (to minimize the total loop count). Raymond

On 3 Jun 2008, at 19:44, Raymond Hettinger wrote:
That's true. However there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
And A.difference(B, C, D) could be rewritten as A - set.union(B, C, D) which may be just as clear. -- Arnaud

On 3 Jun 2008, at 20:32, Raymond Hettinger wrote:
Ah yes. Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning. -- Arnaud

2008/6/3 Raymond Hettinger <python@rcn.com>:
I have looked at setobject.c, it seems mostly straightforward to make the change, e.g. for union this is the current union method: static PyObject * set_union(PySetObject *so, PyObject *other) { PySetObject *result; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; if ((PyObject *)so == other) return (PyObject *)result; if (set_update_internal(result, other) == -1) { Py_DECREF(result); return NULL; } return (PyObject *)result; } A slight change makes it accept variable number of arguments (note that I haven't tried to optimize it by rearranging the order of arguments): static PyObject * set_union(PySetObject *so, PyObject *args) { PySetObject *result; PyObject *arg; Py_ssize_t nargs; Py_ssize_t i; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; nargs = PyTuple_GET_SIZE(args); for (i = 0; i < nargs; i++) { arg = PyTuple_GET_ITEM(args, i); if ((PyObject *)so != arg && set_update_internal(result, arg) == -1) { Py_DECREF(result); return NULL; } } return (PyObject *)result; } And then in set_methods[], change {"union", (PyCFunction)set_union, METH_O, union_doc}, to {"union", (PyCFunction)set_union, METH_VARARGS, vunion_doc}, I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time. -- Arnaud

From: "Arnaud Delobelle" <arnodel@googlemail.com>
It's up to you. If you can get a reviewer and want to go for it, that's fine. Otherwise, you can wait on me to get to it (it's been on my list for several months). In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order. The other six cases should also be processed in-order (left-to-right). Raymond

On 4 Jun 2008, at 09:45, Raymond Hettinger wrote:
Given that A - X - Y - Z is the same as (A-X) & (A-Y) & (A-Z) If it is a significant optimization to intersect sets from smallest to largest (as opposed to just starting with the smallest one and then intersecting from left to right), then should the same idea be applied to difference, except that obviously you start with the leftmost one and sort the others from largest to smallest? (I am *not* proposing to compute A-X, A-Y, ... and then intersect them!) -- Arnaud

Just realized that I failed to send this to the list as well: On Tue, Jun 3, 2008 at 3:48 PM, Raymond Hettinger <python@rcn.com> wrote:
Right, but that's why these functions do not accept more than two arguments. They are intended to be used as instance methods only. If we're promoting the idea of a set.method(*args) usage, however, the usage should probably be intuitive. Because they are associative, union, intersection, and symmetric_difference are all intuitive and do what is expected no matter what. That is not true of set-difference. In other words, it looks to me like set.method(*args) is trying to say "Take all of elements in these iterables and make one set out of them," or more simply, "Throw all this crap together." Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense? In mathematics, there are symbols for set.union(*args) (big-U) and set.intersection(*args) (big-upside-down-U), because they actually come up in common usage. I'm not aware of any such symbols for other set operations. Now that doesn't necessarily mean we shouldn't support them, but it is certainly something to think about. To take it from another angle, it is easy to define: set.union(*args) - the set of all the elements appearing in at least one of the args set.intersection(*args) - the set of all the elements appearing in every arg set.symmetric_difference(*args) - the set of all the elements appearing in an odd number of arguments but: set.difference(*args) - the set of all elements appearing in the first arg but not any of the rest is fundamentally different. When using set operations, ordering shouldn't even be a consideration. However, A.difference(*args) - the set of all elements in A that do not appear in any of the args is well-defined. For that reason, I say that we should support *args for all set operations, but we should only promote the use of set.method syntax for intersection and union. set.difference doesn't seem well-defined, and set.symmetric_difference doesn't seem very useful (and could lead to usage of set.difference). So... +1 supporting *args for all set operations +1 documenting the usage of set.union(*args) and set.intersection(*args) as unioning/intersecting all of the arguments -1 even mentioning set.difference or set.symmetric_difference in static usage That's my 2c, Brandon

2008/6/3 Arnaud Delobelle <arnodel@googlemail.com>:
I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn. [1] http://bugs.python.org/issue3069 -- Arnaud

Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as: big_u = set() for s in all_sets: big_u.update(s) I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib. George On Mon, Jun 2, 2008 at 7:48 AM, Brandon Mintern <bmintern@gmail.com> wrote:

On Mon, Jun 2, 2008 at 9:21 PM, George Sakkis <george.sakkis@gmail.com> wrote:
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing ** _iadd_ ** would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant. Brandon

On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern <bmintern@gmail.com> wrote:
No, it uses __add__: $ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set' You can easily see the quadratic behavior of __add__: $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(100)), Set())" 100 loops, best of 3: 2.4 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(200)), Set())" 100 loops, best of 3: 8.04 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(400)), Set())" 10 loops, best of 3: 33.3 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(800)), Set())" 10 loops, best of 3: 141 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(1600)), Set())" 10 loops, best of 3: 684 msec per loop George

On Mon, Jun 2, 2008 at 10:06 PM, George Sakkis <george.sakkis@gmail.com> wrote:
[snip] Ouch. Never mind my idea then. I do find that rather strange, though. It seems kind of strange to be able define an initial value but not use it as an accumulator. This means that sum on list, mutable user number classes, etc. is bound to be less efficient than it could be. Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not", and then maybe we can revisit this idea at that time. Honestly, what's wrong with sum being defined as: def sum (iterable, start=0): acc = start for i in iteratble: acc += i return acc Even though I'm making a lot of bad proposals, I sure am learning a lot. Brandon

On Tue, Jun 3, 2008 at 9:57 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Obviously, I meant to say "sum" there instead of "max" (which I'm pretty sure you realized as well) -- I had been using max at the time that I wrote that e-mail.
That was my intention, to take advantage of increased efficiency provided by mutable initial values. Unfortunately, I didn't consider the "existing code" problem, but that's not really a problem in Python 3K, is it? For example, sum(lists, []) currently runs in quadratic time (as pointed out by George Sakkis earlier in this thread using an example of sets that implement __add__). If instead, sum was implemented as: def sum (iterable, init=0): for i in iterable: init += i return init Then its behavior would mimic its current behavior for immutable types and other types which do not implement iadd, but for types that allow more efficient value modification, it would be a big win. Is there a good use case for a time when you wouldn't want an initial value to be mutated? In my experience, I've always passed in throwaway initial values anyways, like []. Brandon p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".

On Tue, Jun 03, 2008, Brandon Mintern wrote:
Because the culture is to prefer private replies when public replies are not necessary. Plus it provides a small level of insurance against really private replies going out in public. Here are the canonical URLs: http://www.unicom.com/pw/reply-to-harmful.html http://www.metasystema.net/essays/reply-to.mhtml -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "In many ways, it's a dull language, borrowing solid old concepts from many other languages & styles: boring syntax, unsurprising semantics, few automatic coercions, etc etc. But that's one of the things I like about it." --Tim Peters on Python, 16 Sep 1993

On 3 Jun 2008, at 02:21, George Sakkis wrote:
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have A = S.union(T, U, V) mean A = S.union(T) A.update(U) A.update(V) As a consequence of Python method implementation, one could write instead: A = set.union(S, T, U, V) B = set.intersection(S, T, U, V) which reads nicely -- Arnaud

+1. I use union and intersection as sum-like functions from time to time, enough to warrant implementation in my "standard utility import". Although I'm not sure about the interface. A sum-like interface which receives a sequence might be better, although they are pretty much equivalent. ------------------------- Imri Goldberg www.algorithm.co.il/blogs www.imri.co.il ------------------------- Insert Signature Here ------------------------- Mathias Panzenböck wrote:

Something like this has been on my todo list for a while. Patches are welcome. It should be done for union, intersection, difference, and symmetric difference. Some attempt should be made to optimize the ordering so that a&b&c&d will run from the smallest set to the largest (to minimize the total loop count). Raymond

On 3 Jun 2008, at 19:44, Raymond Hettinger wrote:
That's true. However there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
And A.difference(B, C, D) could be rewritten as A - set.union(B, C, D) which may be just as clear. -- Arnaud

On 3 Jun 2008, at 20:32, Raymond Hettinger wrote:
Ah yes. Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning. -- Arnaud

2008/6/3 Raymond Hettinger <python@rcn.com>:
I have looked at setobject.c, it seems mostly straightforward to make the change, e.g. for union this is the current union method: static PyObject * set_union(PySetObject *so, PyObject *other) { PySetObject *result; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; if ((PyObject *)so == other) return (PyObject *)result; if (set_update_internal(result, other) == -1) { Py_DECREF(result); return NULL; } return (PyObject *)result; } A slight change makes it accept variable number of arguments (note that I haven't tried to optimize it by rearranging the order of arguments): static PyObject * set_union(PySetObject *so, PyObject *args) { PySetObject *result; PyObject *arg; Py_ssize_t nargs; Py_ssize_t i; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; nargs = PyTuple_GET_SIZE(args); for (i = 0; i < nargs; i++) { arg = PyTuple_GET_ITEM(args, i); if ((PyObject *)so != arg && set_update_internal(result, arg) == -1) { Py_DECREF(result); return NULL; } } return (PyObject *)result; } And then in set_methods[], change {"union", (PyCFunction)set_union, METH_O, union_doc}, to {"union", (PyCFunction)set_union, METH_VARARGS, vunion_doc}, I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time. -- Arnaud

From: "Arnaud Delobelle" <arnodel@googlemail.com>
It's up to you. If you can get a reviewer and want to go for it, that's fine. Otherwise, you can wait on me to get to it (it's been on my list for several months). In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order. The other six cases should also be processed in-order (left-to-right). Raymond

On 4 Jun 2008, at 09:45, Raymond Hettinger wrote:
Given that A - X - Y - Z is the same as (A-X) & (A-Y) & (A-Z) If it is a significant optimization to intersect sets from smallest to largest (as opposed to just starting with the smallest one and then intersecting from left to right), then should the same idea be applied to difference, except that obviously you start with the leftmost one and sort the others from largest to smallest? (I am *not* proposing to compute A-X, A-Y, ... and then intersect them!) -- Arnaud

Just realized that I failed to send this to the list as well: On Tue, Jun 3, 2008 at 3:48 PM, Raymond Hettinger <python@rcn.com> wrote:
Right, but that's why these functions do not accept more than two arguments. They are intended to be used as instance methods only. If we're promoting the idea of a set.method(*args) usage, however, the usage should probably be intuitive. Because they are associative, union, intersection, and symmetric_difference are all intuitive and do what is expected no matter what. That is not true of set-difference. In other words, it looks to me like set.method(*args) is trying to say "Take all of elements in these iterables and make one set out of them," or more simply, "Throw all this crap together." Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense? In mathematics, there are symbols for set.union(*args) (big-U) and set.intersection(*args) (big-upside-down-U), because they actually come up in common usage. I'm not aware of any such symbols for other set operations. Now that doesn't necessarily mean we shouldn't support them, but it is certainly something to think about. To take it from another angle, it is easy to define: set.union(*args) - the set of all the elements appearing in at least one of the args set.intersection(*args) - the set of all the elements appearing in every arg set.symmetric_difference(*args) - the set of all the elements appearing in an odd number of arguments but: set.difference(*args) - the set of all elements appearing in the first arg but not any of the rest is fundamentally different. When using set operations, ordering shouldn't even be a consideration. However, A.difference(*args) - the set of all elements in A that do not appear in any of the args is well-defined. For that reason, I say that we should support *args for all set operations, but we should only promote the use of set.method syntax for intersection and union. set.difference doesn't seem well-defined, and set.symmetric_difference doesn't seem very useful (and could lead to usage of set.difference). So... +1 supporting *args for all set operations +1 documenting the usage of set.union(*args) and set.intersection(*args) as unioning/intersecting all of the arguments -1 even mentioning set.difference or set.symmetric_difference in static usage That's my 2c, Brandon

2008/6/3 Arnaud Delobelle <arnodel@googlemail.com>:
I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn. [1] http://bugs.python.org/issue3069 -- Arnaud
participants (10)
-
Aahz
-
Arnaud Delobelle
-
Brandon Mintern
-
George Sakkis
-
Greg Ewing
-
Imri Goldberg
-
Jacob Holm
-
Mathias Panzenböck
-
Mattias Engdegård
-
Raymond Hettinger