I would like to propose that the + operator be a synonym for | (union) on sets. IMHO, the succinct power of being able to say sum(list_of_sets, set()) as opposed to import operator reduce(operator.and_, list_of_sets, set()) far outweighs any problems with having two operators for union. The sum paradigm is certainly more readable as well. I realize that a function named "unionall" could be defined trivially, but with a built-in already serving the purpose in a readable and reasonable way, it seems silly to not use it. I anticipate that someone will ask why such a paradigm should be available for union as opposed to the other set operations. My answer is that several standard algorithms rely on a union over a sequence of sets; one example is the construction of the "follow" set in constructing an LL or SLR parser, and I know I've seen others that I cannot remember off the top of my head. There is even a mathematical symbol (big-U) for it akin to the summation sign (uppercase sigma). Any feedback? Brandon
Regardless of the operator, that's a pretty inefficient way of doing
"unionall"; it creates N-1 intermediate result sets that discards them right
after they are added. It should be written as:
big_u = set()
for s in all_sets:
big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function
has to be in the stdlib.
George
On Mon, Jun 2, 2008 at 7:48 AM, Brandon Mintern
I would like to propose that the + operator be a synonym for | (union) on sets. IMHO, the succinct power of being able to say
sum(list_of_sets, set())
as opposed to
import operator reduce(operator.and_, list_of_sets, set())
far outweighs any problems with having two operators for union. The sum paradigm is certainly more readable as well. I realize that a function named "unionall" could be defined trivially, but with a built-in already serving the purpose in a readable and reasonable way, it seems silly to not use it.
I anticipate that someone will ask why such a paradigm should be available for union as opposed to the other set operations. My answer is that several standard algorithms rely on a union over a sequence of sets; one example is the construction of the "follow" set in constructing an LL or SLR parser, and I know I've seen others that I cannot remember off the top of my head. There is even a mathematical symbol (big-U) for it akin to the summation sign (uppercase sigma).
Any feedback?
Brandon _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Mon, Jun 2, 2008 at 9:21 PM, George Sakkis
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing ** _iadd_ ** would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant. Brandon
On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing ** _iadd_ ** would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant.
No, it uses __add__: $ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set' You can easily see the quadratic behavior of __add__: $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(100)), Set())" 100 loops, best of 3: 2.4 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(200)), Set())" 100 loops, best of 3: 8.04 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(400)), Set())" 10 loops, best of 3: 33.3 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(800)), Set())" 10 loops, best of 3: 141 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i*10, i*10+10)) for i in xrange(1600)), Set())" 10 loops, best of 3: 684 msec per loop George
On Mon, Jun 2, 2008 at 10:06 PM, George Sakkis
On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern
wrote: I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing ** _iadd_ ** would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant.
No, it uses __add__:
$ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set'
You can easily see the quadratic behavior of __add__:
[snip] Ouch. Never mind my idea then. I do find that rather strange, though. It seems kind of strange to be able define an initial value but not use it as an accumulator. This means that sum on list, mutable user number classes, etc. is bound to be less efficient than it could be. Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not", and then maybe we can revisit this idea at that time. Honestly, what's wrong with sum being defined as: def sum (iterable, start=0): acc = start for i in iteratble: acc += i return acc Even though I'm making a lot of bad proposals, I sure am learning a lot. Brandon
On 3 Jun 2008, at 02:21, George Sakkis wrote:
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have A = S.union(T, U, V) mean A = S.union(T) A.update(U) A.update(V) As a consequence of Python method implementation, one could write instead: A = set.union(S, T, U, V) B = set.intersection(S, T, U, V) which reads nicely -- Arnaud
+1. I use union and intersection as sum-like functions from time to time, enough to warrant implementation in my "standard utility import". Although I'm not sure about the interface. A sum-like interface which receives a sequence might be better, although they are pretty much equivalent. ------------------------- Imri Goldberg www.algorithm.co.il/blogs www.imri.co.il ------------------------- Insert Signature Here ------------------------- Mathias Panzenböck wrote:
Arnaud Delobelle schrieb:
As a consequence of Python method implementation, one could write
instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely
I think this, indeed, reads nicely.
-panzi _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T) A.update(U) A.update(V)
Something like this has been on my todo list for a while. Patches are welcome. It should be done for union, intersection, difference, and symmetric difference. Some attempt should be made to optimize the ordering so that a&b&c&d will run from the smallest set to the largest (to minimize the total loop count). Raymond
On 3 Jun 2008, at 19:04, Raymond Hettinger wrote:
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have A = S.union(T, U, V) mean A = S.union(T) A.update(U) A.update(V)
Something like this has been on my todo list for a while.
Patches are welcome. It should be done for union, intersection, difference, and symmetric difference.
Difference is not an associative operation though. E.g. A - (B - B) = A but (A - B) - B = A - B Same for symmetric difference (here "^" stands for symmetric difference). E.g. A ^ (B ^ B) = A but (A ^ B) ^ B = A | B -- Arnaud
On 3 Jun 2008, at 19:44, Raymond Hettinger wrote:
Difference is not an associative operation though. E.g.
A.difference(B, C, D) means A - B - C - D which can be (((A - B) - C) - D) or (((A - D) - C) - B) or (((A - C) - B) - D) or (((A - C) - D) - B)
You can do the subtractions from A in any order.
That's true. However there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
And A.difference(B, C, D) could be rewritten as A - set.union(B, C, D) which may be just as clear. -- Arnaud
Arnaud Delobelle wrote:
Difference is not an associative operation though. E.g.
A - (B - B) = A but (A - B) - B = A - B
Same for symmetric difference (here "^" stands for symmetric difference). E.g.
A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
Actually, symmetric difference *is* associative. The second '=' above is wrong unless B is a subset of A. Regards Jacob
From: "Arnaud Delobelle"
there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
That's silly. Lot's of functions do odd things with random argument ordering:
s = set([9, 3]) int.__sub__(*s) 6
Besides, you can already run the sample fragment in Py2.5:
sos = set( [frozenset([1, 2]), frozenset([2, 3])]) frozenset.difference(*sos) frozenset([1])
Raymond
On 3 Jun 2008, at 20:12, Mattias Engdegård wrote:
Arnaud Delobelle
writes: A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
No, symmetric set difference is associative.
Sorry, I don't know what came over me. It's obviously associative because an element is in the symmetric difference if it is in an odd numbers of sets. -- Arnaud
On 3 Jun 2008, at 20:32, Raymond Hettinger wrote:
A.difference(B, C, D) could be rewritten as A - set.union(B, C, D) which may be just as clear.
And grotesquely inefficient.
Ah yes. Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning. -- Arnaud
Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning.
No worries, except that mailman saves a copy, google indexes it, and all of your posts will be visible to your great-great grandchildren for all eternity. Other than that, it's just a casual idea session between friends ;-) Raymond
On Tue, Jun 3, 2008 at 9:57 PM, Greg Ewing
Brandon Mintern wrote:
Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not"
Obviously, I meant to say "sum" there instead of "max" (which I'm pretty sure you realized as well) -- I had been using max at the time that I wrote that e-mail.
That could change the behaviour of existing code that passes a mutable initial value.
-- Greg
That was my intention, to take advantage of increased efficiency provided by mutable initial values. Unfortunately, I didn't consider the "existing code" problem, but that's not really a problem in Python 3K, is it? For example, sum(lists, []) currently runs in quadratic time (as pointed out by George Sakkis earlier in this thread using an example of sets that implement __add__). If instead, sum was implemented as: def sum (iterable, init=0): for i in iterable: init += i return init Then its behavior would mimic its current behavior for immutable types and other types which do not implement iadd, but for types that allow more efficient value modification, it would be a big win. Is there a good use case for a time when you wouldn't want an initial value to be mutated? In my experience, I've always passed in throwaway initial values anyways, like []. Brandon p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".
Just realized that I failed to send this to the list as well:
On Tue, Jun 3, 2008 at 3:48 PM, Raymond Hettinger
That's silly. Lot's of functions do odd things with random argument ordering:
s = set([9, 3]) int.__sub__(*s)
6
Besides, you can already run the sample fragment in Py2.5:
sos = set( [frozenset([1, 2]), frozenset([2, 3])]) frozenset.difference(*sos)
frozenset([1])
Raymond
Right, but that's why these functions do not accept more than two arguments. They are intended to be used as instance methods only. If we're promoting the idea of a set.method(*args) usage, however, the usage should probably be intuitive. Because they are associative, union, intersection, and symmetric_difference are all intuitive and do what is expected no matter what. That is not true of set-difference. In other words, it looks to me like set.method(*args) is trying to say "Take all of elements in these iterables and make one set out of them," or more simply, "Throw all this crap together." Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense? In mathematics, there are symbols for set.union(*args) (big-U) and set.intersection(*args) (big-upside-down-U), because they actually come up in common usage. I'm not aware of any such symbols for other set operations. Now that doesn't necessarily mean we shouldn't support them, but it is certainly something to think about. To take it from another angle, it is easy to define: set.union(*args) - the set of all the elements appearing in at least one of the args set.intersection(*args) - the set of all the elements appearing in every arg set.symmetric_difference(*args) - the set of all the elements appearing in an odd number of arguments but: set.difference(*args) - the set of all elements appearing in the first arg but not any of the rest is fundamentally different. When using set operations, ordering shouldn't even be a consideration. However, A.difference(*args) - the set of all elements in A that do not appear in any of the args is well-defined. For that reason, I say that we should support *args for all set operations, but we should only promote the use of set.method syntax for intersection and union. set.difference doesn't seem well-defined, and set.symmetric_difference doesn't seem very useful (and could lead to usage of set.difference). So... +1 supporting *args for all set operations +1 documenting the usage of set.union(*args) and set.intersection(*args) as unioning/intersecting all of the arguments -1 even mentioning set.difference or set.symmetric_difference in static usage That's my 2c, Brandon
2008/6/3 Raymond Hettinger
Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning.
No worries, except that mailman saves a copy, google indexes it, and all of your posts will be visible to your great-great grandchildren for all eternity. Other than that, it's just a casual idea session between friends ;-)
Raymond
I have looked at setobject.c, it seems mostly straightforward to make the change, e.g. for union this is the current union method: static PyObject * set_union(PySetObject *so, PyObject *other) { PySetObject *result; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; if ((PyObject *)so == other) return (PyObject *)result; if (set_update_internal(result, other) == -1) { Py_DECREF(result); return NULL; } return (PyObject *)result; } A slight change makes it accept variable number of arguments (note that I haven't tried to optimize it by rearranging the order of arguments): static PyObject * set_union(PySetObject *so, PyObject *args) { PySetObject *result; PyObject *arg; Py_ssize_t nargs; Py_ssize_t i; result = (PySetObject *)set_copy(so); if (result == NULL) return NULL; nargs = PyTuple_GET_SIZE(args); for (i = 0; i < nargs; i++) { arg = PyTuple_GET_ITEM(args, i); if ((PyObject *)so != arg && set_update_internal(result, arg) == -1) { Py_DECREF(result); return NULL; } } return (PyObject *)result; } And then in set_methods[], change {"union", (PyCFunction)set_union, METH_O, union_doc}, to {"union", (PyCFunction)set_union, METH_VARARGS, vunion_doc}, I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time. -- Arnaud
From: "Arnaud Delobelle"
I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time.
It's up to you. If you can get a reviewer and want to go for it, that's fine. Otherwise, you can wait on me to get to it (it's been on my list for several months). In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order. The other six cases should also be processed in-order (left-to-right). Raymond
On Tue, Jun 03, 2008, Brandon Mintern wrote:
p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".
Because the culture is to prefer private replies when public replies are not necessary. Plus it provides a small level of insurance against really private replies going out in public. Here are the canonical URLs: http://www.unicom.com/pw/reply-to-harmful.html http://www.metasystema.net/essays/reply-to.mhtml -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "In many ways, it's a dull language, borrowing solid old concepts from many other languages & styles: boring syntax, unsurprising semantics, few automatic coercions, etc etc. But that's one of the things I like about it." --Tim Peters on Python, 16 Sep 1993
2008/6/4 Raymond Hettinger
In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order.
That means sorting the inputs by size, doesn't it? Looking at PyList_Sort, it doesn't let you provide a key, that means explicit DSU. Or is there another way? -- Arnaud
Brandon Mintern wrote:
Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense?
Taking a lead from mathematics here, there are symbols for the sum and product of a sequence (big-sigma and big-pi) but not difference or quotient, for similar reasons. -- Greg
On 4 Jun 2008, at 09:45, Raymond Hettinger wrote:
In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order.
The other six cases should also be processed in-order (left-to-right).
Given that A - X - Y - Z is the same as (A-X) & (A-Y) & (A-Z) If it is a significant optimization to intersect sets from smallest to largest (as opposed to just starting with the smallest one and then intersecting from left to right), then should the same idea be applied to difference, except that obviously you start with the leftmost one and sort the others from largest to smallest? (I am *not* proposing to compute A-X, A-Y, ... and then intersect them!) -- Arnaud
2008/6/3 Arnaud Delobelle
On 3 Jun 2008, at 02:21, George Sakkis wrote:
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T) A.update(U) A.update(V)
As a consequence of Python method implementation, one could write instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely
I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn. [1] http://bugs.python.org/issue3069 -- Arnaud
From: "Arnaud Delobelle" As a consequence of Python method implementation, one could write instead: A = set.union(S, T, U, V)
B = set.intersection(S, T, U, V) which reads nicely I've written a patch [1] that does that. Following the suggestion of
Raymond Hettinger, I've implemented set.intersection by sorting all
its sets/frozensets/dicts in increasing order of size first, then
iterating over the smallest. It's the first time I try my hand at
this so it might not be up to much, but I've made it so I might as
well send it :). It's against py3k svn. Thanks. It looks like I beat you to it. But I will go over your code
and incorporate some version of the sorting for interections and
harvest the tests. Also, I'll go ahead and add you to Misc/ACKS.
Raymond
On 10 Jun 2008, at 00:33, Raymond Hettinger wrote:
From: "Arnaud Delobelle"
As a consequence of Python method implementation, one could write instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn. [1] http://bugs.python.org/issue3069
Thanks. It looks like I beat you to it. But I will go over your code and incorporate some version of the sorting for interections and harvest the tests. Also, I'll go ahead and add you to Misc/ACKS.
Thanks! I'm a bit ashamed of the tests though. -- Arnaud
participants (10)
-
Aahz
-
Arnaud Delobelle
-
Brandon Mintern
-
George Sakkis
-
Greg Ewing
-
Imri Goldberg
-
Jacob Holm
-
Mathias Panzenböck
-
Mattias Engdegård
-
Raymond Hettinger