I would like to propose that the + operator be a synonym for | (union) on sets. IMHO, the succinct power of being able to say
sum(list_of_sets, set())
as opposed to
import operator reduce(operator.and_, list_of_sets, set())
far outweighs any problems with having two operators for union. The sum paradigm is certainly more readable as well. I realize that a function named "unionall" could be defined trivially, but with a built-in already serving the purpose in a readable and reasonable way, it seems silly to not use it.
I anticipate that someone will ask why such a paradigm should be available for union as opposed to the other set operations. My answer is that several standard algorithms rely on a union over a sequence of sets; one example is the construction of the "follow" set in constructing an LL or SLR parser, and I know I've seen others that I cannot remember off the top of my head. There is even a mathematical symbol (big-U) for it akin to the summation sign (uppercase sigma).
Any feedback?
Brandon
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
On Mon, Jun 2, 2008 at 7:48 AM, Brandon Mintern bmintern@gmail.com wrote:
I would like to propose that the + operator be a synonym for | (union) on sets. IMHO, the succinct power of being able to say
sum(list_of_sets, set())
as opposed to
import operator reduce(operator.and_, list_of_sets, set())
far outweighs any problems with having two operators for union. The sum paradigm is certainly more readable as well. I realize that a function named "unionall" could be defined trivially, but with a built-in already serving the purpose in a readable and reasonable way, it seems silly to not use it.
I anticipate that someone will ask why such a paradigm should be available for union as opposed to the other set operations. My answer is that several standard algorithms rely on a union over a sequence of sets; one example is the construction of the "follow" set in constructing an LL or SLR parser, and I know I've seen others that I cannot remember off the top of my head. There is even a mathematical symbol (big-U) for it akin to the summation sign (uppercase sigma).
Any feedback?
Brandon
Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Mon, Jun 2, 2008 at 9:21 PM, George Sakkis george.sakkis@gmail.com wrote:
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing _iadd_ would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant.
Brandon
On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern bmintern@gmail.com wrote:
>
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing _iadd_ would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant.
No, it uses __add__:
$ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set'
You can easily see the quadratic behavior of __add__:
$ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i10, i10+10)) for i in xrange(100)), Set())" 100 loops, best of 3: 2.4 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i10, i10+10)) for i in xrange(200)), Set())" 100 loops, best of 3: 8.04 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i10, i10+10)) for i in xrange(400)), Set())" 10 loops, best of 3: 33.3 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i10, i10+10)) for i in xrange(800)), Set())" 10 loops, best of 3: 141 msec per loop $ python -m timeit -s "class Set(set): __add__=set.__or__" "sum( (Set(range(i10, i10+10)) for i in xrange(1600)), Set())" 10 loops, best of 3: 684 msec per loop
George
On Mon, Jun 2, 2008 at 10:06 PM, George Sakkis george.sakkis@gmail.com wrote:
On Mon, Jun 2, 2008 at 9:28 PM, Brandon Mintern bmintern@gmail.com wrote: >
I thought max was implemented using += (i.e. it usually starts at 0 and uses += on each item in the iterable). If so, implementing _iadd_ would result in exactly the code you posted. I realize that I said __add__ in the first place, but __iadd__ is really what I meant.
No, it uses __add__:
$ python -c " class Set(set): __iadd__=set.__ior__ sum([Set([1]), Set([2])], Set()) " Traceback (most recent call last): File "<string>", line 3, in <module> TypeError: unsupported operand type(s) for +: 'Set' and 'Set'
You can easily see the quadratic behavior of __add__:
[snip]
Ouch. Never mind my idea then. I do find that rather strange, though. It seems kind of strange to be able define an initial value but not use it as an accumulator. This means that sum on list, mutable user number classes, etc. is bound to be less efficient than it could be. Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not", and then maybe we can revisit this idea at that time. Honestly, what's wrong with sum being defined as:
def sum (iterable, start=0): acc = start for i in iteratble: acc += i return acc
Even though I'm making a lot of bad proposals, I sure am learning a lot.
Brandon
On Tue, Jun 3, 2008 at 9:57 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Brandon Mintern wrote: >
Perhaps a better proposal would be "change max to use __iadd__ if available, falling back to __add__ if not"
Obviously, I meant to say "sum" there instead of "max" (which I'm pretty sure you realized as well) -- I had been using max at the time that I wrote that e-mail.
That could change the behaviour of existing code that passes a mutable initial value.
-- Greg
That was my intention, to take advantage of increased efficiency provided by mutable initial values. Unfortunately, I didn't consider the "existing code" problem, but that's not really a problem in Python 3K, is it?
For example, sum(lists, []) currently runs in quadratic time (as pointed out by George Sakkis earlier in this thread using an example of sets that implement __add__). If instead, sum was implemented as:
def sum (iterable, init=0): for i in iterable: init += i return init
Then its behavior would mimic its current behavior for immutable types and other types which do not implement iadd, but for types that allow more efficient value modification, it would be a big win.
Is there a good use case for a time when you wouldn't want an initial value to be mutated? In my experience, I've always passed in throwaway initial values anyways, like [].
Brandon
p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".
On Tue, Jun 03, 2008, Brandon Mintern wrote: >
p.s. Sorry, Greg, for the dupe. Why don't the Python mailing lists generate Reply-To headers? It's pretty annoying to always have to remember to say "Reply to all" instead of simply "Reply".
Because the culture is to prefer private replies when public replies are not necessary. Plus it provides a small level of insurance against really private replies going out in public.
Here are the canonical URLs:
http://www.unicom.com/pw/reply-to-harmful.html
Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/
"In many ways, it's a dull language, borrowing solid old concepts from many other languages & styles: boring syntax, unsurprising semantics, few automatic coercions, etc etc. But that's one of the things I like about it." --Tim Peters on Python, 16 Sep 1993
On 3 Jun 2008, at 02:21, George Sakkis wrote:
Regardless of the operator, that's a pretty
inefficient way of doing
"unionall"; it creates N-1 intermediate result sets that discards
them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line
function has to be in the stdlib.
George
Perhaps it would be nice to have set.union (and set.intersection) to
accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T)
A.update(U)
A.update(V)
As a consequence of Python method implementation, one could write
instead:
A = set.union(S, T, U, V)
B = set.intersection(S, T, U, V)
which reads nicely
-- Arnaud
+1.
I use union and intersection as sum-like functions from time to time, enough to warrant implementation in my "standard utility import".
Although I'm not sure about the interface. A sum-like interface which receives a sequence might be better, although they are pretty much equivalent.
Imri Goldberg www.algorithm.co.il/blogs
Mathias Panzenböck wrote:
Arnaud Delobelle schrieb: >
As a consequence of Python method implementation, one could write instead:
A = set.union(S, T, U, V)
B = set.intersection(S, T, U, V)
which reads nicely
I think this, indeed, reads nicely.
-panzi
Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
Perhaps it would be nice to have set.union (and
set.intersection) to
accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T)
A.update(U)
A.update(V)
Something like this has been on my todo list for a while.
Patches are welcome. It should be done for union, intersection, difference, and symmetric difference. Some attempt should be made to optimize the ordering so that a&b&c&d will run from the smallest set to the largest (to minimize the total loop count).
Raymond
On 3 Jun 2008, at 19:04, Raymond Hettinger wrote:
Perhaps it
would be nice to have set.union (and set.intersection)
to accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T)
A.update(U)
A.update(V)
Something like this has been on my todo list for a while.
Patches are welcome. It should be done for union, intersection, difference, and symmetric difference.
Difference is not an associative operation though. E.g.
A - (B - B) = A but (A - B) - B = A - B
Same for symmetric difference (here "^" stands for symmetric
difference). E.g.
A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
-- Arnaud
On 3 Jun 2008, at 19:44, Raymond Hettinger wrote:
Difference is not an associative operation though. E.g.
A.difference(B, C, D) means A - B - C - D which can be (((A - B) - C) - D) or (((A - D) - C) - B) or (((A - C) - B) - D) or (((A - C) - D) - B)
You can do the subtractions from A in any order.
That's true. However there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
And
A.difference(B, C, D)
could be rewritten as
A - set.union(B, C, D)
which may be just as clear.
-- Arnaud
On 3 Jun 2008, at 20:32, Raymond Hettinger wrote:
> >
A.difference(B, C, D) could be rewritten as A - set.union(B, C, D) which may be just as clear.
And grotesquely inefficient.
Ah yes. Given the other grossly inaccurate statement in that post, I
conclude that drinking and posting on python-ideas are incompatible
occupations. So I will refrain from any further claim on that subject
till tomorrow morning.
-- Arnaud
Given the other grossly inaccurate statement in that
post, I
conclude that drinking and posting on python-ideas are incompatible
occupations. So I will refrain from any further claim on that subject
till tomorrow morning.
No worries, except that mailman saves a copy, google indexes it, and all of your posts will be visible to your great-great grandchildren for all eternity. Other than that, it's just a casual idea session between friends ;-)
Raymond
2008/6/3 Raymond Hettinger python@rcn.com:
Given the other grossly inaccurate statement in that post, I conclude that drinking and posting on python-ideas are incompatible occupations. So I will refrain from any further claim on that subject till tomorrow morning.
No worries, except that mailman saves a copy, google indexes it, and all of your posts will be visible to your great-great grandchildren for all eternity. Other than that, it's just a casual idea session between friends ;-)
Raymond
I have looked at setobject.c, it seems mostly straightforward to make the change, e.g. for union this is the current union method:
static PyObject set_union(PySetObject so, PyObject other) { PySetObject result;
result = (PySetObject *)set_copy(so);
if (result == NULL)
return NULL;
if ((PyObject *)so == other)
return (PyObject *)result;
if (set_update_internal(result, other) == -1) {
Py_DECREF(result);
return NULL;
}
return (PyObject *)result;
}
A slight change makes it accept variable number of arguments (note that I haven't tried to optimize it by rearranging the order of arguments):
static PyObject set_union(PySetObject so, PyObject args) { PySetObject result; PyObject *arg; Py_ssize_t nargs; Py_ssize_t i;
result = (PySetObject *)set_copy(so);
if (result == NULL)
return NULL;
nargs = PyTuple_GET_SIZE(args);
for (i = 0; i < nargs; i++) {
arg = PyTuple_GET_ITEM(args, i);
if ((PyObject *)so != arg && set_update_internal(result, arg) == -1) {
Py_DECREF(result);
return NULL;
}
}
return (PyObject *)result;
}
And then in set_methods[], change
{"union", (PyCFunction)set_union, METH_O,
union_doc},
to
{"union", (PyCFunction)set_union, METH_VARARGS,
vunion_doc},
I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time.
-- Arnaud
From: "Arnaud Delobelle" arnodel@googlemail.com
I would go through the whole of setobjects.c and try to do this for each relevant method. However I have not touched Python source code before, and I am not at all confident that I would make a good job of it. If someone offered to review my effort privately, I'd be happy to have a go but OTOH I don't want to waste anyone's time.
It's up to you. If you can get a reviewer and want to go for it, that's fine. Otherwise, you can wait on me to get to it (it's been on my list for several months).
In the case of intersection and intersection_update, if the inputs are
sets or dicts, then they should be processed smallest to largest.
If the inputs are not sets or dicts, then process them in input order.
The other six cases should also be processed in-order (left-to-right).
Raymond
2008/6/4 Raymond Hettinger python@rcn.com:
In the case of intersection and intersection_update, if the inputs are sets or dicts, then they should be processed smallest to largest. If the inputs are not sets or dicts, then process them in input order.
That means sorting the inputs by size, doesn't it? Looking at PyList_Sort, it doesn't let you provide a key, that means explicit DSU. Or is there another way?
-- Arnaud
On 4 Jun 2008, at 09:45, Raymond Hettinger wrote:
In the case of intersection and intersection_update,
if the inputs
are sets or dicts, then they should be processed smallest to
largest. If the inputs are not sets or dicts, then process them in
input order.
The other six cases should also be processed in-order (left-to-right).
Given that
A - X - Y - Z
is the same as
(A-X) & (A-Y) & (A-Z)
If it is a significant optimization to intersect sets from smallest to
largest (as opposed to just starting with the smallest one and then
intersecting from left to right), then should the same idea be applied
to difference, except that obviously you start with the leftmost one
and sort the others from largest to smallest?
(I am not proposing to compute A-X, A-Y, ... and then intersect them!)
-- Arnaud
From: "Arnaud Delobelle" arnodel@googlemail.com
there no way to predict the output of this:
set_of_sets = { {1, 2}, {2, 3} } set.difference(*set_of_sets)
That's silly. Lot's of functions do odd things with random argument ordering:
s = set([9, 3]) int.__sub__(*s) 6
Besides, you can already run the sample fragment in Py2.5:
sos = set( [frozenset([1, 2]), frozenset([2, 3])]) frozenset.difference(*sos) frozenset([1])
Raymond
Just realized that I failed to send this to the list as well:
On Tue, Jun 3, 2008 at 3:48 PM, Raymond Hettinger python@rcn.com wrote:
That's silly. Lot's of functions do odd things with random argument ordering:
s = set([9, 3]) int.__sub__(*s)
6
Besides, you can already run the sample fragment in Py2.5:
sos = set( [frozenset([1, 2]), frozenset([2, 3])]) frozenset.difference(*sos)
frozenset([1])
Raymond
Right, but that's why these functions do not accept more than two arguments. They are intended to be used as instance methods only. If we're promoting the idea of a set.method(*args) usage, however, the usage should probably be intuitive. Because they are associative, union, intersection, and symmetric_difference are all intuitive and do what is expected no matter what. That is not true of set-difference.
In other words, it looks to me like set.method(*args) is trying to say "Take all of elements in these iterables and make one set out of them," or more simply, "Throw all this crap together." Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense?
In mathematics, there are symbols for set.union(args) (big-U) and set.intersection(args) (big-upside-down-U), because they actually come up in common usage. I'm not aware of any such symbols for other set operations. Now that doesn't necessarily mean we shouldn't support them, but it is certainly something to think about.
To take it from another angle, it is easy to define:
set.union(*args) - the set of all the elements appearing in at least one of the args
set.intersection(*args) - the set of all the elements appearing in every arg
set.symmetric_difference(*args) - the set of all the elements appearing in an odd number of arguments
but:
set.difference(*args) - the set of all elements appearing in the first arg but not any of the rest
is fundamentally different. When using set operations, ordering shouldn't even be a consideration. However,
A.difference(*args) - the set of all elements in A that do not appear in any of the args
is well-defined. For that reason, I say that we should support *args for all set operations, but we should only promote the use of set.method syntax for intersection and union. set.difference doesn't seem well-defined, and set.symmetric_difference doesn't seem very useful (and could lead to usage of set.difference).
So...
+1 supporting args for all set operations +1 documenting the usage of set.union(args) and set.intersection(*args) as unioning/intersecting all of the arguments -1 even mentioning set.difference or set.symmetric_difference in static usage
That's my 2c, Brandon
Brandon Mintern wrote:
Intuitively, it _shouldn't_ matter what order the arguments are in. What is the meaning of taking the set-difference of a bunch of sets? Should we promote an operation that doesn't make any sense?
Taking a lead from mathematics here, there are symbols for the sum and product of a sequence (big-sigma and big-pi) but not difference or quotient, for similar reasons.
-- Greg
Arnaud Delobelle arnodel@googlemail.com writes:
A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
No, symmetric set difference is associative.
On 3 Jun 2008, at 20:12, Mattias Engdegård wrote:
Arnaud Delobelle arnodel@googlemail.com writes:
A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
No, symmetric set difference is associative.
Sorry, I don't know what came over me. It's obviously associative
because an element is in the symmetric difference if it is in an odd
numbers of sets.
-- Arnaud
Arnaud Delobelle wrote: >
Difference is not an associative operation though. E.g.
A - (B - B) = A but (A - B) - B = A - B
Same for symmetric difference (here "^" stands for symmetric difference). E.g.
A ^ (B ^ B) = A but (A ^ B) ^ B = A | B
Actually, symmetric difference is associative. The second '=' above is wrong unless B is a subset of A.
Regards
Jacob
2008/6/3 Arnaud Delobelle arnodel@googlemail.com: >
On 3 Jun 2008, at 02:21, George Sakkis wrote:
Regardless of the operator, that's a pretty inefficient way of doing "unionall"; it creates N-1 intermediate result sets that discards them right after they are added. It should be written as:
big_u = set() for s in all_sets: big_u.update(s)
I wouldn't mind having a standard unionall, but not every 3-line function has to be in the stdlib.
George
Perhaps it would be nice to have set.union (and set.intersection) to accept more than one argument, i.e. have
A = S.union(T, U, V)
mean
A = S.union(T) A.update(U) A.update(V)
As a consequence of Python method implementation, one could write instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely
I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn.
[1] http://bugs.python.org/issue3069
-- Arnaud
From: "Arnaud Delobelle" <arnodel@googlemail.com
As a consequence of Python method implementation, one could write instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely
I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn.
Thanks. It looks like I beat you to it. But I will go over your code and incorporate some version of the sorting for interections and harvest the tests. Also, I'll go ahead and add you to Misc/ACKS.
Raymond
On 10 Jun 2008, at 00:33, Raymond Hettinger wrote:
From: "Arnaud Delobelle" <arnodel@googlemail.com
As a
consequence of Python method implementation, one could write
instead:
A = set.union(S, T, U, V) B = set.intersection(S, T, U, V)
which reads nicely I've written a patch [1] that does that. Following the suggestion of Raymond Hettinger, I've implemented set.intersection by sorting all its sets/frozensets/dicts in increasing order of size first, then iterating over the smallest. It's the first time I try my hand at this so it might not be up to much, but I've made it so I might as well send it :). It's against py3k svn. [1] http://bugs.python.org/issue3069
Thanks. It looks like I beat you to it. But I will go over your code and incorporate some version of the sorting for interections and harvest the tests. Also, I'll go ahead and add you to Misc/ACKS.
Thanks! I'm a bit ashamed of the tests though.
-- Arnaud