Re: [Python-Dev] PEP 218 (sets); moving set.py to Lib
[Michael McLay]
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbo x/sets/set.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
Did you consider making BaseSet._data a slot?
Hm, maybe I should. If this is a proposed standard data type, we might as well get people used to the fact that they can't add random new instance variables without subclassing first. OTOH what then to do with _sort_repr -- make it a class var or an instance var? --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
OTOH what then to do with _sort_repr -- make it a class var or an instance var?
Well, how often can you imagine someone printing out a single set sorted, but having other sets that they didn't want printed out sorted? I would suspect that it is going to be a very rare case when someone wants just part of their sets printing sorted and the rest not. I say make it a class var. -Brett C.
[Guido van Rossum]
OTOH what then to do with _sort_repr -- make it a class var or an instance var?
[Brett C]
Well, how often can you imagine someone printing out a single set sorted, but having other sets that they didn't want printed out sorted? I would suspect that it is going to be a very rare case when someone wants just part of their sets printing sorted and the rest not.
I say make it a class var.
Hm, but what if two different library modules have conflicting requirements? E.g. module A creates sets of complex numbers and must have sort_repr=False, while module B needs sort_repr=True for user-friendliness (or because it relies on this). My current approach (now in CVS!) is to remove the sort_repr flag to the constructor, but to provide a method that can produce a sorted or an unsorted representation. __repr__ will always return the items unsorted, which matches what repr(dict) does. After all, I think it could be confusing to a user when 'print s' shows Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) but for i in s: print i, prints 9 8 7 6 5 4 3 2 1 0 --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
My current approach (now in CVS!) is [...] to provide a method that can produce a sorted or an unsorted representation.
Could a method with a similar name be available for dicts as well? -- François Pinard http://www.iro.umontreal.ca/~pinard
Could a method with a similar name be available for dicts as well?
Well, it wouldn't have any advantage over doing this "by hand", extracting the keys into a list and sorting that. The same reasoning applies to the sets class, which is why I've made it a non-public method (named '_repr'). It may go if I find a solution to the one use there is in the test suite. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
... My current approach (now in CVS!) is to remove the sort_repr flag to the constructor, but to provide a method that can produce a sorted or an unsorted representation.
+1. That's the best way to go.
__repr__ will always return the items unsorted, which matches what repr (dict) does. After all, I think it could be confusing to a user when 'print s' shows
Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
but
for i in s: print i,
prints
9 8 7 6 5 4 3 2 1 0
from sets import Set print Set(range(10)) Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
When I optimized a useless ~ out of the dict code for 2.2, it became much more likely that the traversal order for an int-keyed dict would match numeric order. I have evidence that this has fooled newbies into believing that dicts are ordered maps! If it wouldn't cost an extra cycle, I'd be tempted to slop the ~ back in again <0.9 wink>.
[Guido van Rossum] <snip>
My current approach (now in CVS!) is to remove the sort_repr flag to the constructor, but to provide a method that can produce a sorted or an unsorted representation. __repr__ will always return the items unsorted, which matches what repr(dict) does. After all, I think it
I just updated my CVS copy and I like your implementation. +1 from me for how you are handling it.
could be confusing to a user when 'print s' shows
Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
but
for i in s: print i,
prints
9 8 7 6 5 4 3 2 1 0
Good point. As Fran?s pointed out, newbies do expect dicts to always come out in the same order (and I must say that I, like Fran?s, have never seen any docs saying that it does always come out the same order as long as nothing has mutated) and I would expect that expectation to carry over to sets. -Brett C.
When I optimized a useless ~ out of the dict code for 2.2, it became much more likely that the traversal order for an int-keyed dict would match numeric order. I have evidence that this has fooled newbies into believing that dicts are ordered maps! If it wouldn't cost an extra cycle, I'd be tempted to slop the ~ back in again <0.9 wink>.
Maybe add a ~ to the int hash code? That means it's not on the critical path for dicts with string keys. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Monday 19 August 2002 04:38 pm, Guido van Rossum wrote:
[Guido van Rossum]
OTOH what then to do with _sort_repr -- make it a class var or an instance var?
[Brett C]
Well, how often can you imagine someone printing out a single set sorted, but having other sets that they didn't want printed out sorted? I would suspect that it is going to be a very rare case when someone wants just part of their sets printing sorted and the rest not.
I say make it a class var.
Hm, but what if two different library modules have conflicting requirements? E.g. module A creates sets of complex numbers and must have sort_repr=False, while module B needs sort_repr=True for user-friendliness (or because it relies on this).
Adding a SortedSet class to the module would partially solve the problem of globally clobbering the usage of Set in other modules. The downside is that the selection of the sort function would be a static decision made when when a set instance is created.
class Set(object): ... _sort_repr=False ... __slots__ = ["_data"] ... def __init__(self....
class SortedSet(Set): ... _sort_repr=True ... ss = SortedSet() s = Set() s._sort_repr False ss._sort_repr True
Adding a SortedSet class to the module would partially solve the problem of globally clobbering the usage of Set in other modules.
I say YAGNI. I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument). --Guido van Rossum (home page: http://www.python.org/~guido/)
On Tuesday 20 August 2002 01:22 pm, Guido van Rossum wrote:
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
I haven't read the entire thread, but I was puzzled by the implementation approach. Did you consider kjbuckets for the standard Python distribution? While the claim is rather old, the following quote from Aaron's intro [1] to the module suggests it might improve performance: For suitably large compute intensive uses these types should provide up to an order of magnitude speedup versus an implementation that uses analogous operations implemented directly in Python. Adding the gadfly SQL database to the standard library would also be useful, but since it is back under development it would be best for gadfly to live on a separate release cycle. The kjbuckets software, however, doesn't seem to be changing. One more reason for adding kjbuckets, Tim Berner-Lee might find the kjGraphs class useful for the semantic web work. [1] http://starship.python.net/crew/aaron_watters/kjbuckets/kjbuckets.html
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
I haven't read the entire thread, but I was puzzled by the implementation approach. Did you consider kjbuckets for the standard Python distribution?
No. I think that would be the wrong idea at this point for two reasons: (1) never change two variables at the same time; (2) let's gather some experience with the new set API first, before we start worrying about implementation speed. I also believe that kjbuckets maintains its data in a sorted order, which is unnecessary for sets -- a hash table is much faster. After all we use a very fast hash table implementation to represent sets. (The only improvement would be that we could save maybe 4 bytes per hash table entry because we don't need a value pointer.)
While the claim is rather old, the following quote from Aaron's intro [1] to the module suggests it might improve performance:
For suitably large compute intensive uses these types should provide up to an order of magnitude speedup versus an implementation that uses analogous operations implemented directly in Python.
The sets module does not implement analogous operations directly in Python. Almost all the implementation work is done by the dict implementation.
Adding the gadfly SQL database to the standard library would also be useful, but since it is back under development it would be best for gadfly to live on a separate release cycle. The kjbuckets software, however, doesn't seem to be changing.
Because nobody is maintaining it any more.
One more reason for adding kjbuckets, Tim Berner-Lee might find the kjGraphs class useful for the semantic web work.
[1] http://starship.python.net/crew/aaron_watters/kjbuckets/kjbuckets.html
kjbuckets may be nice, but adding it to the core would add a serious new maintenance burden for the core developers. I don't see anyone raising their hand to help out here. --Guido van Rossum (home page: http://www.python.org/~guido/)
From: "Guido van Rossum"
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet. It should be replaced by code raising a TypeError just like we do for basestring:
basestring('abc') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: The basestring type cannot be instantiated
Raymond Hettinger P.S. More comments are on the way as we play with, profile, review, optimize, and document the module ;)
From: "Guido van Rossum"
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet. It should be replaced by code raising a TypeError just like we do for basestring:
basestring('abc') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: The basestring type cannot be instantiated
Good idea. I checked this in, raising NotImplementedError.
Raymond Hettinger
P.S. More comments are on the way as we play with, profile, review, optimize, and document the module ;)
Didn't you submit a SF patch/bug? I think I replied to that. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
I am still perplexed that I received *no* feedback on the sets module
As I previously said, I feel comfortable with what I read and saw. I'd probably have to use sets for having more circumstantiated comments. Unless you offer the source on c.l.py and ask for more users' opinions? Maybe some people would have preferred to see more usual notation, like `+' for union and `*' for intersection, rather than `or' and `and'? There are tiny pros and cons in each direction. For one, I'll gladly use what is available, I'm not really going to crusade for either notation... Should there be special provisions for Sets to interoperate magically with lists or iterators? Lists and iterators could be considered as ordered sets with duplicates allowed. Even if it could be tinily useful, it is surely not difficult to explicitly "cast" lists and iterators using the `Set' constructor. It is already easy to build an iterator or a list out of a set. Criticism? OK! What about supporting infinite sets? :-) Anything else? Hmph! The module doc-string has the word "actually" with three `l'! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard
[GvR]
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
[RH]
P.S. More comments are on the way as we play with, profile, review, optimize, and document the module ;)
[GvR]
Didn't you submit a SF patch/bug? I think I replied to that.
Yes. I've now revised the patch accordingly. More thoughts: 1. Rename .remove() to __del__(). Its usage is inconsistent with list.remove(element) which can leave other instances of element in the list. It is more consistent with 'del adict[element]'. 2. discard() looks like a useful standard API. Perhaps it shoulds be added to the dictionary API. 3. Should we add .as_temporarily_immutable to dictionaries and lists so that they will also be potential elements of a set? 4. remove(), update(), add(), and __contains__() all work hard to check for .as_temporarily_immutable(). Should this propagated to other methods that add set members(i.e. replace all instances of data[element] = value with self.add(element) or use self.update() in the code for __init__())? The answer is tough because it causes an enormous slowdown in the common use cases of uniquifying a sequence. OTOH, why check in some places but not others -- why is .add(aSetInstance) okay but not Set([aSetInstance]). If the answer is yes, then the code for update() should be super-optimized by taking moving the try/except outside the for-loop and wrapping the whole thing in a while 1. Also, we could bypass the slower .add() method when incoming source of elements is known to be an instance of BaseSet. 5. Add a quick pre-check to issubset() and issuperset() along the lines of: def issubset(self, other): """Report whether another set contains this set.""" self._binary_sanity_check(other) if len(self) > len(other): return False # Fast check for the obvious case for elt in self: if elt not in other: return False return True 6. For clarity and foolish consistency, replace all occurrences of 'elt' with 'element'. Raymond Hettinger
On Tue, 20 Aug 2002, Guido van Rossum wrote:
From: "Guido van Rossum"
I am still perplexed that I receoved *no* feedback on the sets module except on this issue of sort order (which I consider solved by adding a method _repr() that takes an optional 'sorted' argument).
I think the __init__() code in BaseSet should be pushed down into Set and ImmutableSet. It should be replaced by code raising a TypeError just like we do for basestring:
basestring('abc') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: The basestring type cannot be instantiated
Good idea. I checked this in, raising NotImplementedError.
Is there any particular reason BaseSet and basestring need to raise different exceptions on an attempt at instantiation ? /Paul
On Tue, Aug 20, 2002, Raymond Hettinger wrote:
1. Rename .remove() to __del__(). Its usage is inconsistent with list.remove(element) which can leave other instances of element in the list. It is more consistent with 'del adict[element]'.
You mean __delitem__, I think. __del__ is only for deleting the object itself when its refcount goes to zero.
3. Should we add .as_temporarily_immutable to dictionaries and lists so that they will also be potential elements of a set?
There's been some talk in the past about creating lockable dicts and lists (emphasis on dicts because lists have tuple-equivalence). -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
From: "Aahz"
1. Rename .remove() to __del__(). Its usage is inconsistent with list.remove(element) which can leave other instances of element in the list. It is more consistent with 'del adict[element]'.
You mean __delitem__, I think. __del__ is only for deleting the object itself when its refcount goes to zero.
Yes!
I am still perplexed that I received *no* feedback on the sets module
As I previously said, I feel comfortable with what I read and saw. I'd probably have to use sets for having more circumstantiated comments.
Fair enough.
Unless you offer the source on c.l.py and ask for more users' opinions?
Last time I tried that it turned out a bad idea. I prefer feedback over a flame war.
Maybe some people would have preferred to see more usual notation, like `+' for union and `*' for intersection, rather than `or' and `and'? There are tiny pros and cons in each direction. For one, I'll gladly use what is available, I'm not really going to crusade for either notation...
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP). But yes, this is decent feedback (with good enough arguments, Greg's conclusion might even be overturned).
Should there be special provisions for Sets to interoperate magically with lists or iterators? Lists and iterators could be considered as ordered sets with duplicates allowed. Even if it could be tinily useful, it is surely not difficult to explicitly "cast" lists and iterators using the `Set' constructor. It is already easy to build an iterator or a list out of a set.
You can do an in-place union of a Set and a sequence or iterable with set.update(seq). If you want intersection or a difference, or your set is immutable, you'd have to cast the sequence to a set. What's the use case? Which brings me to another open issue. set.update(seq) and set.add(element) have a provision to transform the inserted element(s) to an ImmutableSet if needed. Should the constructor do the same?
Criticism? OK! What about supporting infinite sets? :-) Anything else? Hmph! The module doc-string has the word "actually" with three `l'! :-)
Not any more, thanks to Raymond Hettinger. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school.
Really? The notation I learned in school was big-rounded-U for union and big-upside-down-rounded-U for intersection. Not available in the ASCII character set, unfortunately. But I agree that | and & are fairlly intuitive substitutes for these, and they agree with what you use for bit twiddling. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Guido van Rossum
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP).
+1 on preferring | and & to `or' and `and'. To me, `or' and `and' say that what's being composed are predicates, not sets. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
"Eric S. Raymond"
To me, `or' and `and' say that what's being composed are predicates, not sets.
Besides which, they can't be overridden in Python anyway. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school.
Really? The notation I learned in school was big-rounded-U for union and big-upside-down-rounded-U for intersection. Not available in the ASCII character set, unfortunately.
For historical reasons, there are three different notations for Boolean algebra in common use. You're describing the one derived from set theory. I personally favor the one derived from lattice algebra; the distinctive feature of that one is the pointy and &/| operators that look like /\ and \/. The third uses | and &. The set-theoretic notation is the oldest. I think Birkhoff & MacLane invented the lattice-theory notation in the 1940s. It is probably *slightly* more popular than the set-theoretic notation. The | & one is distinctly less common than either, at least among mathematicians; I think EEs and suchlike may use it more than we do.
But I agree that | and & are fairlly intuitive substitutes for these, and they agree with what you use for bit twiddling.
Not an insignificant point. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Is there any particular reason BaseSet and basestring need to raise different exceptions on an attempt at instantiation ?
Hm, I dunno. NotImplementedError was intended for this kind of use, but TypeError also matches. I'll add an XXX for this. --Guido van Rossum (home page: http://www.python.org/~guido/)
1. Rename .remove() to __del__(). Its usage is inconsistent with list.remove(element) which can leave other instances of element in the list. It is more consistent with 'del adict[element]'.
(You mean __delete__.) -1. Sets don't support the x[y] notation for accessing or setting elements, so it would be weird to use that for deleting. You're not deleting the value corresponding to the key or index y (like you are when using del x[y] on a list or dict), you're deleting y itself. That's more like x.remove(y) for lists.
2. discard() looks like a useful standard API. Perhaps it shoulds be added to the dictionary API.
Perhaps. But the dict API already has so many things. And why not to the list API? I'm -0 on this.
3. Should we add .as_temporarily_immutable to dictionaries and lists so that they will also be potential elements of a set?
Hm, I think this is premature. I'd like to see a use case for a set of dicts and a set of lists first. Then you can code list and dict subclasses in Python that implement _as_temporarily_immutable (and _as_immutable too, I suppose). Then we'll see how often this ends up being used. For now, I'd say YAGNI. (I've said YAGNI to sets for years, but the realization that as a result lots of people independently invented using the keys of a dict to represent a set made me change my mind. Sort of like how I changed my mind on a Boolean type, too, after 12 years of thinking it wasn't needed. :-)
4. remove(), update(), add(), and __contains__() all work hard to check for .as_temporarily_immutable(). Should this propagated to other methods that add set members(i.e. replace all instances of data[element] = value with self.add(element) or use self.update() in the code for __init__())?
I've been thinking the same thing. I think that the only case where this could apply is __init__(), by the way.
The answer is tough because it causes an enormous slowdown in the common use cases of uniquifying a sequence. OTOH, why check in some places but not others -- why is .add(aSetInstance) okay but not Set([aSetInstance]).
Really? Why the slowdown? I was thinking of simply changing __init__ into if seq is not None: self.update(seq) If that's too slow, perhaps update() could be changed to the following: it = iter(seq) try: for elt in it: data[elt] = value except TypeError: transform = getattr(elt, '_as_immutable', None) if transform is None: raise data[transform()] = value self.update(it) That is, if there are no elements that require transformation, the added cost is a single try/except setup (plus an extra call to iter()). If any element requires transformation, the rest of the elements are dealt with as fast as update() can. Hm, maybe this could be applied to update() too (except it shouldn't call itself recursively but simply write the loop out a second time, with a try/except around each element).
If the answer is yes, then the code for update() should be super-optimized by taking moving the try/except outside the for-loop and wrapping the whole thing in a while 1.
That's a similar idea as I just sketched. Can you email me a proposed patch? (Let's skip SF for this.)
Also, we could bypass the slower .add() method when incoming source of elements is known to be an instance of BaseSet.
Huh? Nobody calls add() internally AFAIK.
5. Add a quick pre-check to issubset() and issuperset() along the lines of:
def issubset(self, other): """Report whether another set contains this set.""" self._binary_sanity_check(other) if len(self) > len(other): return False # Fast check for the obvious case for elt in self: if elt not in other: return False return True
Sure. Check it in.
6. For clarity and foolish consistency, replace all occurrences of 'elt' with 'element'.
Hm, no. 'element' for a loop control variable seems too long (I'd be happy with 'x' but Greg Wilson used 'element'). However I like 'element' as the argument name because it can be used as a keyword argument and then it's better spelled out in full. I think Greg Wilson used 'item' most of the time; I prefer to be consistent and say 'element' all the time since that's the accepted set terminology. --Guido van Rossum (home page: http://www.python.org/~guido/)
It should have powerset and cartesian-product methods. Shall I code them?
The Cartesian product of two sets is easy: def product(s1, s2): cp = Set() for x in s1: for y in s2: cp.add((x, y)) return cp But I'm not sure if this is useful enough to add to the set class -- it seems a great way to waste a lot of memory, and typical uses are probably better served by writing out the for-loops. Perhaps this could be coded as a generator though. More fun would be the cartesian product of N sets; I guess it would have to be recursive. Here's a generator version: def product(s, *sets): if not sets: for x in s: yield (x,) else: subproduct = list(product(*sets)) for x in s: for t in subproduct: yield (x,) + t Note that this doesn't require sets; its arguments can be arbitrary iterables. So maybe this belongs in a module of iterator utilities rather than in the sets module. API choice: product() with a single argument yields a series of 1-tuples. That's slightly awkward, but works better for the recursion. And specifically asking for the Cartesian product of a single set is kind of pointless. If we *do* add this to the Set class, it could be aliased to __mul__ (giving another reason why using * for set intersection is a bad idea). Here's a naive powerset implementation returning a set: def power(s): ps = Set() ps.add(ImmutableSet([])) for elt in s: s1 = Set([elt]) ps1 = Set() for ss in ps: ps1.add(ss | s1) ps |= ps1 return ps This is even more of a memory hog; however the algorithm is slightly more subtle so it's perhaps more valuable to have this in the library. Here's a generator version: def power(s): if len(s) == 0: yield Set() else: # Non-destructively choose a random element: x = Set([iter(s).next()]) for ss in power(s - x): yield ss yield ss | x I'm not totally happy with this -- it recurses for each element in s, creating a new set at each level that is s minus one element. I'd prefer to build the set up from the other end, like the first version. IOW I'd love to see your version. :-) The first power() example raises a point about the set API: the Set() constructor can be called without an iterable argument, but ImmutableSet() cannot. Maybe ImmutableSet() should be allowed too? It creates an immutable empty set. (Hm, this could be a singleton. __new__ could take care of that.) While comparing the various versions of power(), I also ran into an interesting bug in the code. While Set([1]) == ImmutableSet([1]), Set([Set([1])]) != Set([ImmutableSet([1])]). I have to investigate this. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
Hm, no. 'element' for a loop control variable seems too long (I'd be happy with 'x' but Greg Wilson used 'element'). However I like 'element' as the argument name because it can be used as a keyword argument and then it's better spelled out in full. I think Greg Wilson used 'item' most of the time; I prefer to be consistent and say 'element' all the time since that's the accepted set terminology.
Briefly reverting to type as a logician, Eric applauds. Sometimes I tell you not to sweat what the my ex-colleagues will think, but this is a case in which using mathemtically-correct terminology will *not* obscure the difference between stateless/mathematical reasoning and stateful/programming reasoning, and is therefore a good idea. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Guido van Rossum
Here's a generator version:
def power(s): if len(s) == 0: yield Set() else: # Non-destructively choose a random element: x = Set([iter(s).next()]) for ss in power(s - x): yield ss yield ss | x
I'm not totally happy with this -- it recurses for each element in s, creating a new set at each level that is s minus one element. I'd prefer to build the set up from the other end, like the first version.
You're right, that is an ugly and opaque piece of code. Guido me lad, you have been led down the garden path by a dubious lust for recursive elegance. One might almost think you were a LISP hacker or something.
IOW I'd love to see your version. :-)
Here's the pre-generator version I wrote using lists as the underlying representation. Should be trivially transformable into a generator version. I'd do it myself but I'm heads-down on bogofilter just now def powerset(base): "Compute the set of all subsets of a set." powerset = [] for n in xrange(2 ** len(base)): subset = [] for e in xrange(len(base)): if n & 2 ** e: subset.append(base[e]) powerset.append(subset) return powerset Are you slapping your forehead yet? :-) -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
[Guido van Rossum]
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP).
The very usual notation for me has been the big `U' for union and the same, upside-down, for intersection, but even now that Python supports Unicode, these are not Python operators _yet_. :-) I never saw `|' nor `&' in literature, except `|' which means "such that" in set comprehensions, as Pythoneers would be tempted to say! On the other hand, for programmers, `|' and `&' are rather natural and easy. Eric has offered the idea of adding Cartesian product, and despite the usual notation is a tall thin `X', maybe it would be nice reserving `*' for that? It might not be explicit enough, and besides, there are other circumstances in Algebra, not so far from sets, when one might need many multiplicative operators, so `*' would easily get over-used. -- François Pinard http://www.iro.umontreal.ca/~pinard
Is there any particular reason BaseSet and basestring need to raise different exceptions on an attempt at instantiation ?
Hm, I dunno. NotImplementedError was intended for this kind of use, but TypeError also matches. I'll add an XXX for this.
I found a good reason why it should be TypeError, so TypeError it is. --Guido van Rossum (home page: http://www.python.org/~guido/)
Here's the pre-generator version I wrote using lists as the underlying representation. Should be trivially transformable into a generator version. I'd do it myself but I'm heads-down on bogofilter just now
def powerset(base): "Compute the set of all subsets of a set." powerset = [] for n in xrange(2 ** len(base)): subset = [] for e in xrange(len(base)): if n & 2 ** e: subset.append(base[e]) powerset.append(subset) return powerset
Are you slapping your forehead yet? :-)
Yes! I didn't actually know that algorithm.
Here's the generator version for sets (still requires a real set as
input):
def powerset(base):
size = len(base)
for n in xrange(2**size):
subset = []
for e, x in enumerate(base):
if n & 2**e:
subset.append(x)
yield Set(subset)
I would like to write n & (1<
On Tue, Aug 20, 2002, Guido van Rossum wrote:
Is there any particular reason BaseSet and basestring need to raise different exceptions on an attempt at instantiation ?
Hm, I dunno. NotImplementedError was intended for this kind of use, but TypeError also matches. I'll add an XXX for this.
I found a good reason why it should be TypeError, so TypeError it is.
Mind telling us? (I've always used NotImplementedError, so I'm curious.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
Mind telling us? (I've always used NotImplementedError, so I'm curious.)
OK, from the checkins: - Changed the NotImplementedError in BaseSet.__init__ to TypeError, both for consistency with basestring() and because we have to use TypeError when denying Set.__hash__. Together those provide sufficient evidence that an unimplemented method needs to raise TypeError. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
Are you slapping your forehead yet? :-)
Yes! I didn't actually know that algorithm.
I thought it up myself. Which is funny, since to get there you have to think like a hardware engineer rather than a logician. My brain was definitely out of character that day.
This is a generator that yields a series of lists whose values are the items of base. And again, like cartesian product, it's now more a generator thing than a set thing.
I don't care where it lives, really. I just like the concision of being able to say foo.powerset(). Not that I've used this yet, but I know one algorithm for which it would be helpful. Another one I invented, actually, back when I really was a mathematician -- a closed form for sums of certain categories of probability distributions. I called it the Dungeon Dice Theorem. Never published it.
BTW, the correctness of all my versions trivially derives from the correctness of your version -- each is a very simple transformation of the previous one. My mentor Lambert Meertens calls this process Algorithmics (and has developed a mathematical notation and theory for program transformations).
Web pointer? -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
François Pinard
I never saw `|' nor `&' in literature, except `|' which means "such that" in set comprehensions, as Pythoneers would be tempted to say! On the other hand, for programmers, `|' and `&' are rather natural and easy.
Now that I think of it, I've never seen a mathematician use this at all. But I agree that it's a good choice for programmersl
Eric has offered the idea of adding Cartesian product, and despite the usual notation is a tall thin `X', maybe it would be nice reserving `*' for that?
Mildly in favor, but I wouldn't cry if it didn't happen. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
[Guido]
... I also believe that kjbuckets maintains its data in a sorted order, which is unnecessary for sets -- a hash table is much faster.
It's Zope's BTrees that maintain sorted order. kjbuckets consists of 3 variations ("set", "dict", "graph") of a rather complicated hash table, driven by the need for the graph flavor to associate multiple values with a single key. The kj hash table slots each contain a small contiguous vector, and then it starts to get complicated <wink>.
After all we use a very fast hash table implementation to represent sets. (The only improvement would be that we could save maybe 4 bytes per hash table entry because we don't need a value pointer.)
The set flavor of kjbucket does skip the value pointer. I suppose it could have supported multisets too ("bags" -- duplicate keys allowed), but it doesn't (they can be faked via a kjgraph using dummy values, though -- much like faking a set in Python via a dict with dummy values!).
... The sets module does not implement analogous operations directly in Python. Almost all the implementation work is done by the dict implementation.
kjbuckets has a rich set of operations coded in C, including intersection, graph composition, and transitive closure. If the sets module were to sprout those operations too, they would have to look like the sets intersection implementation (nests of Python loops and ifs), as Python dicts don't support primitives able to polish off large chunks of the necessary work at C speed. Aaron's claim that kjbuckets can do those kinds of things 10x faster than Python code seems quite safe <wink>.
... kjbuckets may be nice, but adding it to the core would add a serious new maintenance burden for the core developers. I don't see anyone raising their hand to help out here.
Not me. It's about 3500 lines of hairy code, and you wouldn't like some of the interface decisions it made. For example, a_kjset[3] = 5 adds 3 to the set and ignores 5. Even Aaron blushes at that one <wink>, but some of the others would be much harder to sort out. It would be a major undertaking to do so.
[Guido, eventually arrives at ...]
def powerset(base): pairs = list(enumerate(base)) size = len(pairs) for n in xrange(2**size): subset = [] for e, x in pairs: if n & 2**e: subset.append(x) yield subset
Now let's rewrite that in modern Python <wink>: def powerset(base): pairs = [(2**i, x) for i, x in enumerate(base)] for i in xrange(2**len(pairs)): yield [x for (mask, x) in pairs if i & mask]
This is a generator that yields a series of lists whose values are the items of base. And again, like cartesian product, it's now more a generator thing than a set thing.
Generators are great for constructing families of combinatorial objects, BTW. They can be huge, and in algorithms that use them as inputs for searches, you can often expect not to need more than a miniscule fraction of the entire family, but can't predict how many you'll need. Generators are perfect then.
BTW, the correctness of all my versions trivially derives from the correctness of your version -- each is a very simple transformation of the previous one. My mentor Lambert Meertens calls this process Algorithmics (and has developed a mathematical notation and theory for program transformations).
Was he interested in that before his sabbatical with the SETL folks? The SETL project produced lots of great research in that area, largely driven by the desire to help ultra-high-level SETL programs finish in their lifetimes.
Now let's rewrite that in modern Python <wink>:
def powerset(base): pairs = [(2**i, x) for i, x in enumerate(base)] for i in xrange(2**len(pairs)): yield [x for (mask, x) in pairs if i & mask]
Honest, I didn't see this before I posted mine. :-)
This is a generator that yields a series of lists whose values are the items of base. And again, like cartesian product, it's now more a generator thing than a set thing.
Generators are great for constructing families of combinatorial objects, BTW. They can be huge, and in algorithms that use them as inputs for searches, you can often expect not to need more than a miniscule fraction of the entire family, but can't predict how many you'll need. Generators are perfect then.
Yup. That's why I strove for these to be generators.
BTW, the correctness of all my versions trivially derives from the correctness of your version -- each is a very simple transformation of the previous one. My mentor Lambert Meertens calls this process Algorithmics (and has developed a mathematical notation and theory for program transformations).
Was he interested in that before his sabbatical with the SETL folks?
Dunno. Maybe it sparked his interest.
The SETL project produced lots of great research in that area, largely driven by the desire to help ultra-high-level SETL programs finish in their lifetimes.
--Guido van Rossum (home page: http://www.python.org/~guido/)
"Eric S. Raymond"
The | & one is distinctly less common than either, at least among mathematicians; I think EEs and suchlike may use it more than we do.
I'm surprised that mathematicians use | and & at all. I had always assumed that these were invented by the programming community, being available ASCII characters used in programming languages, and that mathematicians wouldn't ever use them if they had a choice. But maybe I'm wrong! Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Guido van Rossum
def product(s1, s2): cp = Set() for x in s1: for y in s2: cp.add((x, y)) return cp
Oh, no. Someone is bound to want set comprehensions, now... Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard):
Eric has offered the idea of adding Cartesian product, and despite the usual notation is a tall thin `X', maybe it would be nice reserving `*' for that?
Maybe '%'? It looks a bit X-ish. Or if Python ever gets a dedicated matrix-multiplication operator, maybe that could be reused. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing
The | & one is distinctly less common than either, at least among mathematicians; I think EEs and suchlike may use it more than we do.
I'm surprised that mathematicians use | and & at all. I had always assumed that these were invented by the programming community, being available ASCII characters used in programming languages, and that mathematicians wouldn't ever use them if they had a choice. But maybe I'm wrong!
Your post crossed one of mine in which, on reflection, I said I'd never seen a mathematician use these. Not even me, not when I'm doing math anyway. I still think in Birkhoff's lattice-theory notation. Nevertheless I'm quite comfortable with | & when programming. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Greg Ewing
pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard):
Eric has offered the idea of adding Cartesian product, and despite the usual notation is a tall thin `X', maybe it would be nice reserving `*' for that?
Maybe '%'? It looks a bit X-ish.
Ugh. *No.* -1 on that, I'd look at % and see a string format operator. Suddenly I like Francois's suggestion a lot better. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
ons 2002-08-21 klockan 03.05 skrev Eric S. Raymond:
For historical reasons, there are three different notations for Boolean algebra in common use. You're describing the one derived from set theory. I personally favor the one derived from lattice algebra; the distinctive feature of that one is the pointy and &/| operators that look like /\ and \/. The third uses | and &.
Uhm, what about + and juxtaposition? They are quite common at least here in Sweden, for boolean algebra. Martin
Martin Sjögren
Uhm, what about + and juxtaposition? They are quite common at least here in Sweden, for boolean algebra.
Is it + for disjunction and juxtaposition for conjunction, or the other way around? Not that I've ever seen either variant. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
ons 2002-08-21 klockan 10.28 skrev Eric S. Raymond:
Martin Sjögren
: Uhm, what about + and juxtaposition? They are quite common at least here in Sweden, for boolean algebra.
Is it + for disjunction and juxtaposition for conjunction, or the other way around? Not that I've ever seen either variant.
I've often seen it in the context of electronics. a+1 = 1, a0 = 0 and so on. That is, + is disjunction and juxtaposition (or a multiplication dot) is conjunction. Hmm, I just realized that I've also seen it in an American book on discrete maths, so it's not just us Swedes ;) Martin
Martin Sjögren
Is it + for disjunction and juxtaposition for conjunction, or the other way around? Not that I've ever seen either variant.
I've often seen it in the context of electronics. a+1 = 1, a0 = 0 and so on. That is, + is disjunction and juxtaposition (or a multiplication dot) is conjunction.
Makes sense. Hardware designers care a lot about reduction to disjunctive normal form. Much more than logicians do, actually.
Hmm, I just realized that I've also seen it in an American book on discrete maths, so it's not just us Swedes ;)
Odd that I haven't encountered it. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
Martin =?ISO-8859-1?Q?Sj=F6gren?=
Uhm, what about + and juxtaposition? They are quite common at least here in Sweden, for boolean algebra.
They're not normally used for sets, though, in my experience (despite the fact that set theory is a boolean algebra:-). Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
"Eric S. Raymond"
Is it + for disjunction and juxtaposition for conjunction, or the other way around?
+ is 'or' and juxtaposition (or sometimes a dot) is 'and' (I prefer those words because they're shorter than 'disjunction' and 'conjunction', and I can remember which is which:-). These are probably where Wirth got the idea of using + and * from in Pascal -- 'or' is considered the 'addition' operator in boolean algebra, and 'and' the 'multiplication' operator. (Although the two are actually completely symmetrical in their algebraic properties, so it's an arbitrary choice.) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
[Guido]
... Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP).
FYI, kjbuckets uses '+' (union) and '&' (intersection). '*' is used for graph composition. It so happens that graph composition applied to sets views each set as a graph of self-loops (1, 2, 7} -> {(1, 1), (2, 2), (7, 7)} and the composition of two such self-loop graphs is the self-loop graph of the sets' intersection. So you can view '*' as being a set intersection operation there. It's more useful to compose a graph with a set, in which case you get the subgraph all of whose start-arc nodes are in the set (set * graph), or all of whose end-arc nodes are in the set (graph * set). This is all very handy if you do a lot of it, but getting comfortable with this higher-level of view of things is at the other end of a learning curve.
[Guido]
Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP).
[Tim]
FYI, kjbuckets uses '+' (union) and '&' (intersection). '*' is used for
FTI, ISETL uses '+' and '*' as synonyms for the spelled-out 'inter' and 'union' operators. Playing with a sample session for possible inclusion in the tutorial, I've found that '|' is not nearly as clear in its intention as '+'. Raymond Hettinger ------------------------------------------------------------------- from sets import Set engineers = Set(['John', 'Jane', 'Jack', 'Janice']) programmers = Set(['Jack', 'Sam', 'Susan', 'Janice']) management = Set(['Jane', 'Jack', 'Susan', 'Zack']) employees = engineers | programmers | management # more clear with '+' engineering_management = engineers & programmers fulltime_management = management - engineers - programmers engineers.add('Marvin') print engineers, 'Look, Marvin was added' print employees.issuperset(engineers), 'There is a problem' print employees, 'Hmm, employees needs an update' employees.update(engineers) print employees, 'Looks fine now' for group in [engineers, programmers, management, employees]: group.discard('Susan') print group
Playing with a sample session for possible inclusion in the tutorial, I've found that '|' is not nearly as clear in its intention as '+'.
It's way too early to say that. I actually like the fact that | and & are a new vocabulary (for containers at least) so they provide an additional hint that we're dealing with a different kind of container. For sequences, a+b != b+a (unless a==b). For sets, a|b == b|a. That's a useful distinction. + is already used for two distinct purposes: for numbers (where it is symmetric) and for sequences (where it is not). But numbers and sequences are unlikely to be confused because they are used so differently. Sets and lists are both containers, and I think it's useful that their vocabularies don't overlap much. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Raymond Hettinger]
FTI, ISETL uses '+' and '*' as synonyms for the spelled-out 'inter' and 'union' operators.
You realize that reads as if they used '+' for 'inter' and '*' for 'union', right?
Playing with a sample session for possible inclusion in the tutorial, I've found that '|' is not nearly as clear in its intention as '+'.
... engineers = Set(['John', 'Jane', 'Jack', 'Janice']) programmers = Set(['Jack', 'Sam', 'Susan', 'Janice']) management = Set(['Jane', 'Jack', 'Susan', 'Zack'])
employees = engineers | programmers | management # more clear with '+'
I haven't made time to play with the new sets module yet, but it was instantly clear to me just as it was. I think Guido makes a very good point about "+" making it much more confusable with a sequence or numeric operation too. OTOH, I'm rarely a fan of overloaded operators, and suspect I'll tend to use whatever .method() names the module supports (the set modules I've written for my own use never overloaded operators, btw). One thing did strike me as odd later! If I were to ask this company's HR director what kinds of employees they had, I bet the answer I'd hear is well, mostly we have engineers and programmers and management It seems far less likely I'd hear well, mostly we have engineers or programmers or management and I read "|" as "or". If I heard well, mostly we have engineers vertical-bar programmers vertical-bar management I'd beg to work there for free <wink>.
tor 2002-08-22 klockan 01.20 skrev Greg Ewing:
Martin =?ISO-8859-1?Q?Sj=F6gren?=
: Uhm, what about + and juxtaposition? They are quite common at least here in Sweden, for boolean algebra.
They're not normally used for sets, though, in my experience (despite the fact that set theory is a boolean algebra:-).
Nope, not for sets. I have rarely seen anything but \cup and \cap for sets. But they are quite often used when working with boolean algebras in general. Martin
Guido van Rossum
[snip]
no hope that this will ever complete in finite time, but does that mean it shouldn't start? I could write 1L<
How about lazy sets? E.g. a CartesianProduct could delegate to its two underlying (concrete) sets when checking for membership, and a PowerSet could perform individual member cheks for each element in a given subset... Etc. I guess this might be too specific for the library -- subclassing ImmutableSet and overriding the accessors shouldn't be too hard... (The nice thing about including it in the library is that you could produce these things as results from operations on Set and ImmutableSet, e.g. 2**some_set could give a power set or whatever...) -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
Greg Ewing
[snip]
Oh, no. Someone is bound to want set comprehensions, now...
That's in the PEP, isn't it? -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
[snip]
no hope that this will ever complete in finite time, but does that mean it shouldn't start? I could write 1L<
How about lazy sets? E.g. a CartesianProduct could delegate to its two underlying (concrete) sets when checking for membership, and a PowerSet could perform individual member cheks for each element in a given subset... Etc.
Have you got a use case for membership tests of a cartesian product?
I guess this might be too specific for the library -- subclassing ImmutableSet and overriding the accessors shouldn't be too hard...
(The nice thing about including it in the library is that you could produce these things as results from operations on Set and ImmutableSet, e.g. 2**some_set could give a power set or whatever...)
Use case? --Guido van Rossum (home page: http://www.python.org/~guido/)
Eric S. Raymond
[snip]
Makes sense. Hardware designers care a lot about reduction to disjunctive normal form. Much more than logicians do, actually.
Hmm, I just realized that I've also seen it in an American book on discrete maths, so it's not just us Swedes ;)
Odd that I haven't encountered it.
Indeed. I thought this was quite standard when working with digital circuits etc... And -- I don't quite see why we're talking about Boolean algebra in general here, when we're specifically looking for set operators... Oh, well. -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
Eric S. Raymond
Guido van Rossum
: Um, the notation is '|' and '&', not 'or' and 'and', and those are what I learned in school. Seems pretty conventional to me (Greg Wilson actually tried this out on unsuspecting newbies and found that while '+' worked okay, '*' did not -- read the PEP).
+1 on preferring | and & to `or' and `and'. To me, `or' and `and' say that what's being composed are predicates, not sets.
I concur completely. Using 'or' and 'and' seems close to overriding 'is' (although that's impossible, of course) to me. To me, the expression set1 and set2 should return the first set, if empty, or the second set, if the first one is empty. Suddenly having their intersection would be very surprising, I think. For set1 & set2 to return their intersection, however, is very consistent with int1 & int2 -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
Guido van Rossum
[snip]
Have you got a use case for membership tests of a cartesian product?
Not that I can think of at the moment, no :-) I guess the idea was to use lazy sets for some such operations. Then you could build complex expressions through cartesian products, unions, intersections, set differences, set comprehensions etc. without actually constructing the full set. Checking for membership or iterating over (or even constructing, after all the operations have been applied) such a set might be useful, I'm sure... You could implement joins with cartesian products without terrible performance penalties etc... But I guess this sort of thing might as well go into some other module somewhere (probably outside the libs). It was just a thought. -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org
[Magnus Lie Hetland]
I guess the idea was to use lazy sets for some such operations. Then you could build complex expressions through cartesian products, unions, intersections, set differences, set comprehensions etc. without actually constructing the full set.
Allow me some random thoughts. (Aren't they always random, anyway? :-) When I saw some of the suggestions on this list for "generating" elements of a cartesian product, despite sometimes elegant, I thought "Too much done, too soon.". But the truth is that I did not give the thing a serious try, I'm not sure I would be able to offer anything better. One nice thing, with a dict or a set, is that we can quickly access how many entries there are in there. Is there some internal way to efficiently fetch the N'th element, from the order in which the keys would be naturally listed? If not, one could always pay some extra temporary memory to build a list of these keys first. If you have to "generate" a cartesian product for N sets, you could set up a compound counter as a list of N indices, the K'th meant to run from 0 up to the cardinality C[K] of the K'th set, and devise simple recipes to yield the element of the product represented by the counter, and to bump it. Moreover, it would be trivial to equip this generator with a `__len__' function able to predict the cardinality CCC of the whole result, and quite easy being able to transform any KKK between 0 and NNN into an equivalent compound counter, and from there, access any member of the cartesian product at constant speed, without generating it all. All the above is pretty simple, and meant to introduce a few suggestions that might solve once and for all, if we could do it well enough, a re-occurring request on the Python list about how to produce permutations and al. We might try to rewrite the recipes behind a "generating" cartesian product of many sets, illustrated above, into a similar generating function able to produce all permutations of a single set. So let's say: Set([1, 2, 3]).permutations() would lazily produce the equivalent of: Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]) That generator could offer a `__len__' function predicting the cardinality CCC of the result, and some trickery could be used to map integers from 0 to CCC into various permutations. Once on the road, it would be worth offering combinations and arrangements just as well, and maybe also offering the "power" set, I mean here the set of all subsets of a given set, all with predictable cardinality and constant speed access to any member by index. Yet, many questions or objections may rise. Using tuples to represent each element of a cartesian product of many sets is pretty natural, but it is slightly less clear that a tuple is the best fit for representing an ordered set as in permutations and arrangements, as tuples may allow elements to be repeated, while an ordered set does not. I think that sets are to be preferred over tuples for returning combinations or subsets. While it is natural to speak and think of subsets of a set, or permutations, arrangements and combinations of a set, some people might prefer to stay closer to an underlying implementations with lists (sublists of a list, permutations, arrangements or combinations of a list), and would feel that going through sets is an unwelcome detour for their applications. Indeed, what's the real use and justficiation for hashing keys and such things, when one wants nothing else than arrangements from a list? Another aspect worth some thinking is that permutations, in particular, are mathematical objects in themselves: we can notably multiply permutations or take the inverse of a permutation. Arrangements are in fact permutations over combinations elements. Some thought is surely needed for properly reflecting mathematical elegance into how the set API is extended for the above, and not merely burying that elegance under practical concerns. Some people may think that these are all problems which are orthogonal to the design of a basic set feature, and which should be addressed in separate Python modules. On the other hand, I think that a better and nicer integration might result if all these things were thought together, and thought sooner than later. Moreover, my intuition tells me that with some care and luck (both are needed), these extra set features could be small enough additions to the `sets' module to not be worth another one. Besides, if appropriate, such facilities would surely add a lot of zest and punch into the interest raised by the `sets' module when it gets published. -- François Pinard http://www.iro.umontreal.ca/~pinard
[François]
Allow me some random thoughts. (Aren't they always random, anyway? :-)
Maybe you could contribute some not-so-random code? Words we've got enough. :-)
When I saw some of the suggestions on this list for "generating" elements of a cartesian product, despite sometimes elegant, I thought "Too much done, too soon.". But the truth is that I did not give the thing a serious try, I'm not sure I would be able to offer anything better.
One nice thing, with a dict or a set, is that we can quickly access how many entries there are in there. Is there some internal way to efficiently fetch the N'th element, from the order in which the keys would be naturally listed? If not, one could always pay some extra temporary memory to build a list of these keys first. If you have to "generate" a cartesian product for N sets, you could set up a compound counter as a list of N indices, the K'th meant to run from 0 up to the cardinality C[K] of the K'th set, and devise simple recipes to yield the element of the product represented by the counter, and to bump it. Moreover, it would be trivial to equip this generator with a `__len__' function able to predict the cardinality CCC of the whole result, and quite easy being able to transform any KKK between 0 and NNN into an equivalent compound counter, and from there, access any member of the cartesian product at constant speed, without generating it all.
Since the user can easily multiply the length of the input sets together, what's the importance of the __len__? And what's the use case for randomly accessing the members of a cartesian product? IMO, the Cartesian product is mostly useful for abstract matehematical though, not for solving actual programming problems.
All the above is pretty simple, and meant to introduce a few suggestions that might solve once and for all, if we could do it well enough, a re-occurring request on the Python list about how to produce permutations and al. We might try to rewrite the recipes behind a "generating" cartesian product of many sets, illustrated above, into a similar generating function able to produce all permutations of a single set. So let's say:
Set([1, 2, 3]).permutations()
would lazily produce the equivalent of:
Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)])
That generator could offer a `__len__' function predicting the cardinality CCC of the result, and some trickery could be used to map integers from 0 to CCC into various permutations. Once on the road, it would be worth offering combinations and arrangements just as well, and maybe also offering the "power" set, I mean here the set of all subsets of a given set, all with predictable cardinality and constant speed access to any member by index.
Obviously you were inspired here by Eric Raymond's implementation of the powerset generator... But again I ask, what's the practical use of random access to permutations? (I know there are plenty of uses of permutations, but I doubt the need for random access.)
Yet, many questions or objections may rise. Using tuples to represent each element of a cartesian product of many sets is pretty natural, but it is slightly less clear that a tuple is the best fit for representing an ordered set as in permutations and arrangements, as tuples may allow elements to be repeated, while an ordered set does not. I think that sets are to be preferred over tuples for returning combinations or subsets.
You must have temporarily stopped thinking clearly there. Seen as sets, all permutations of a set's elements are the same! The proper output for a generator of permutations is a list; for generality, its input should be any iterable (which includes sets). If the input contains duplicates, well, that's the caller's problem. For combinations, sets are suitable as output, but again, I think it would be just a suitable to take a list and generate lists -- after all the lists are trivially turned into sets.
While it is natural to speak and think of subsets of a set, or permutations, arrangements and combinations of a set, some people might prefer to stay closer to an underlying implementations with lists (sublists of a list, permutations, arrangements or combinations of a list), and would feel that going through sets is an unwelcome detour for their applications. Indeed, what's the real use and justficiation for hashing keys and such things, when one wants nothing else than arrangements from a list?
Right.
Another aspect worth some thinking is that permutations, in particular, are mathematical objects in themselves: we can notably multiply permutations or take the inverse of a permutation.
That would be a neat class indeed. How useful it would be in practice remains to be seen. Do you do much ad-hoc permutation calculations?
Arrangements are in fact permutations over combinations elements. Some thought is surely needed for properly reflecting mathematical elegance into how the set API is extended for the above, and not merely burying that elegance under practical concerns.
And, on the other hand, practicality beats purity.
Some people may think that these are all problems which are orthogonal to the design of a basic set feature, and which should be addressed in separate Python modules. On the other hand, I think that a better and nicer integration might result if all these things were thought together, and thought sooner than later. Moreover, my intuition tells me that with some care and luck (both are needed), these extra set features could be small enough additions to the `sets' module to not be worth another one. Besides, if appropriate, such facilities would surely add a lot of zest and punch into the interest raised by the `sets' module when it gets published.
I'd rather see the zest added to Python as a whole -- sets are a tiny part, and if you read PEP 218, you'll see that the sets module is only a modest first step of that PEP's program. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
Since the user can easily multiply the length of the input sets together, what's the importance of the __len__? And what's the use case for randomly accessing the members of a cartesian product?
For cartesian product, permutations, combinations, arrangements and power sets, I do not see real use to __len__ or random accessing of members besides explicit looping (or maybe saving an indice for later resumption). In my own case, I guess iterators (or generators) would leave me happy enough that I do not really need more.
IMO, the Cartesian product is mostly useful for abstract matehematical though, not for solving actual programming problems.
One practical application pops to mind. People might progressively use and abuse the paradigm of looping over the members of a cartesian product, instead of relying on nests of embedded loops over each of the set members.
So let's say: Set([1, 2, 3]).permutations() would lazily produce the equivalent of: Set([(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)])
That generator could offer a `__len__' function predicting the cardinality CCC of the result, and some trickery could be used to map integers from 0 to CCC into various permutations. Once on the road, it would be worth offering combinations and arrangements just as well, and maybe also offering the "power" set, I mean here the set of all subsets of a given set, all with predictable cardinality and constant speed access to any member by index.
Obviously you were inspired here by Eric Raymond's implementation of the powerset generator...
I do not think so. I was inspired by the remembering I have from the SSP package (something like Social Science Package - a FORTRAN library - this was before SPSS). It was especially clever about enumerating permutations.
But again I ask, what's the practical use of random access to permutations?
The only one I see is for resuming an interrupted enumeration.
Using tuples to represent each element of a cartesian product of many sets is pretty natural, but it is slightly less clear that a tuple is the best fit for representing an ordered set as in permutations and arrangements, as tuples may allow elements to be repeated, while an ordered set does not. I think that sets are to be preferred over tuples for returning combinations or subsets.
You must have temporarily stopped thinking clearly there.
Maybe not, but I do not always express myself clearly. Sorry.
Seen as sets, all permutations of a set's elements are the same!
It's no use seeing each individual permutation as a set, of course. But the _set_ of all permutations, each of which being a tuple, is meaningful, at least from the fact that permutations are conceptually unordered. However, it might be (sometimes, maybe not that often) useful to enumerate permutations in some canonical order.
For combinations, sets are suitable as output, but again, I think it would be just a suitable to take a list and generate lists -- after all the lists are trivially turned into sets.
Quite agreed.
Another aspect worth some thinking is that permutations, in particular, are mathematical objects in themselves: we can notably multiply permutations or take the inverse of a permutation.
That would be a neat class indeed. How useful it would be in practice remains to be seen. Do you do much ad-hoc permutation calculations?
A few common algorithms also make use of permutations without naming them. `sort' applies a precise permutation, and it is often convenient to inverse that permutation to recover the original order after having enriched the resulting structure, say. I quite often resorted to the above trick. Notice that `string.translate' applies a permutation. In another life, I wrote a program (unknown outside CDC Cyber space) that was efficiently comparing possibly big files, and I remember I had to work a lot with permutation arithmetic. This was a bit specialised, however. I once wrote a C application named `recode' that mainly does charset conversions. It does some arithmetic on permutations at a few places, either for optimisation or while seeking reversibility, and this is really nothing far stretched or un-natural. I sometimes plan to parallel `recode' with a Python implementation, because from experience, prototyping ideas in C is rather painful, while I foresee it would be far lot easier in Python. Undoubtedly then, I would formalise permutations.
Some thought is surely needed for properly reflecting mathematical elegance into how the set API is extended for the above, and not merely burying that elegance under practical concerns.
And, on the other hand, practicality beats purity.
Only when both conflict without any hope of resolution. But when practicality and purity can coexist, that's even better. So much better in fact, that it's always worth very seriously trying to seek coexistence. -- François Pinard http://www.iro.umontreal.ca/~pinard
[François Pinard]
Allow me some random thoughts. [...] Some people may think that these are all problems which are orthogonal to the design of a basic set feature, and which should be addressed in separate Python modules.
On Monday 26 August 2002 05:56 pm, François Pinard wrote:
The module could be called `cogen', abbreviation for COmbinatorial GENerators. Here is a first throw, to be criticised and improved.
With sets and possibly cogen being added to the standard library is it likely that additional interesting math capabilities will creep into the standard Python libraries? Reducing the clutter of the top level namespace is hard to do if code depends on it, so better to do it right from the start. Do these modules belong at the top level? Would it make sense to change the math module into a package and move the new module inside that the math package?
[Michael McLay]
On Monday 26 August 2002 05:56 pm, François Pinard wrote:
The module could be called `cogen', abbreviation for COmbinatorial GENerators. Here is a first throw, to be criticised and improved.
With [...] possibly cogen being added to the standard library [...] Would it make sense to change the math module into a package and move the new module inside that the math package?
For one, I do not feel that `cogen' is especially mathematical, or geared especially towards mathematical or numerical problems Algorithmic maybe... But a good part of the Python library already shares that property, doesn't it? :-) -- François Pinard http://www.iro.umontreal.ca/~pinard
[f(x, y) for x in X for y in Y] is equivalent to: [f(x, y) for x, y in cartesian(X, Y)] I never found any real use for list comprehensions with more than one dimension. When I use nested loops they are usually for something that cannot be expressed as a list comprehension. Oren
From: "Oren Tirosh"
[f(x, y) for x in X for y in Y]
is equivalent to:
[f(x, y) for x, y in cartesian(X, Y)]
Is the order guaranteed to be the same? Will each work the same for a non-restartable iterator, say a file object (equivalently put, does the second one read Y once or many times)? Would Descartes object to his name being used thusly? Raymond Hettinger
I think of sets as being more closely affiliated with heapq, UserDict, and array
and being less affiliated with math, cmath, random, etc.
Raymond Hettinger
----- Original Message -----
From: "Michael McLay"
[f(x, y) for x in X for y in Y]
is equivalent to:
[f(x, y) for x, y in cartesian(X, Y)]
Hmmm, in other words, cartesian() is a lazy version of zip(). (Given its intended use, I always thought zip() should have been lazy from the beginning, but the BDFL thought otherwise.) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
On Tue, Aug 27, 2002 at 06:14:51PM +1200, Greg Ewing wrote:
[f(x, y) for x in X for y in Y]
is equivalent to:
[f(x, y) for x, y in cartesian(X, Y)]
Hmmm, in other words, cartesian() is a lazy version of zip().
Nope.
zip([1, 2], ['a', 'b']) [(1, 'a'), (2, 'b')]
list(cartesian([1, 2], ['a', 'b'])) [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
Oren
On Tue, Aug 27, 2002 at 01:56:36AM -0400, Raymond Hettinger wrote:
From: "Oren Tirosh"
[f(x, y) for x in X for y in Y]
is equivalent to:
[f(x, y) for x, y in cartesian(X, Y)]
Is the order guaranteed to be the same?
Yes. """\ Combinatorial generators. All generators below have the property of yielding successive results in sorted order, given than input sequences were already sorted. """
Will each work the same for a non-restartable iterator, say a file object (equivalently put, does the second one read Y once or many times)?
They have exactly the same re-iterability wart as nested loops or list comprehensions - an exhausted iterator is indistinguishable from an empty container.
Would Descartes object to his name being used thusly?
The cartesian product is a set operation and therefore has no defined order. When generating it you need some specific order and this one makes the most sense. If you use it with a 'nested loop' mindset instead of a 'set theory' mindset Rene would have had some grounds for objection :-) Oren
Oren Tirosh
Hmmm, in other words, cartesian() is a lazy version of zip().
Nope.
zip([1, 2], ['a', 'b']) [(1, 'a'), (2, 'b')]
list(cartesian([1, 2], ['a', 'b'])) [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
Sorry, BrainError. In that case, it's probably faster to use the nested loops -- unless cartesian() were implemented in C. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
def cartesian(*sequences): """\ Generate the `cartesian product' of all SEQUENCES. Each member of the product is a list containing an element taken from each original sequence. """ if len(sequences) == 0: yield [] else: first, remainder = sequences[0], sequences[1:] for element in first: for result in cartesian(*remainder): result.insert(0, element) yield result
It occurred to me that this is rather ineffecient because it invokes itself recursively many time (once for each element in the first sequence). This version is much faster, because iterating over a built-in sequence (like a list) is much faster than iterating over a generator: def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] for x in cartesian(*head): for y in tail: yield x + [y] I also wonder if perhaps ``tail = list(tail)'' should be inserted just before the for loop, so that the arguments may be iterators as well. I would have more suggestions (I expect that Eric Raymond's powerset is much faster than your recursive subsets()) but my family is calling me... --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, Aug 28, 2002 at 12:40:30PM +1200, Greg Ewing wrote:
Oren Tirosh
: Hmmm, in other words, cartesian() is a lazy version of zip().
Nope.
zip([1, 2], ['a', 'b']) [(1, 'a'), (2, 'b')]
list(cartesian([1, 2], ['a', 'b'])) [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
Sorry, BrainError. In that case, it's probably faster to use the nested loops -- unless cartesian() were implemented in C.
Yes, but a nested loop cannot be easily passed as an argument to a function. Generator functions are pretty efficient, too - yield does not incur the relatively high overhead of Python function calls. Oren
On Tue, Aug 27, 2002 at 09:26:00PM -0400, Guido van Rossum wrote:
It occurred to me that this is rather ineffecient because it invokes itself recursively many time (once for each element in the first sequence). This version is much faster, because iterating over a built-in sequence (like a list) is much faster than iterating over a generator:
def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] for x in cartesian(*head): for y in tail: yield x + [y]
My implementation from http://www.tothink.com/python/dataflow/xlazy.py: def xcartprod(arg1, *args): if not args: for x in arg1: yield (x,) elif len(args) == 1: arg2 = args[0] for x in arg1: for y in arg2: yield x, y else: for x in arg1: for y in xcartprod(args[0], *args[1:]): yield (x,) + y Special-casing the 2 argument case helps a lot. It brings the performace within 50% of nested loops which means that if you actually do something inside the loop the overhead is quite negligible. The 'x' prefix is shared with other functions in this module: a lazy xmap, xzip and xfilter.
I also wonder if perhaps ``tail = list(tail)'' should be inserted just before the for loop, so that the arguments may be iterators as well.
Ahh... re-iterability again... This is a good example of a function that *fails silently* for non re-iterable arguments. Slurping the tail into a list loses the lazy efficiency of this function. One of the ways I've used this function is to scan combinations until a condition is satisfied. The iteration is always terminated before reaching the end. Reading ahead may waste computation and memory. All I want is something that will raise an exception if any argument but the first is not re-iterable (e.g. my reiter() proposal). I'll add list() to the argument myself if I really want to. Don't try to guess what I meant. Oren
Special-casing the 2 argument case helps a lot. It brings the performace within 50% of nested loops which means that if you actually do something inside the loop the overhead is quite negligible.
Hm, I tried that and found no difference. Maybe I didn't benchmark right.
Ahh... re-iterability again...
This is a good example of a function that *fails silently* for non re-iterable arguments.
This failure is hardly silent IMO: the results are totally bogus, which is a pretty good clue that something's wrong.
Slurping the tail into a list loses the lazy efficiency of this function. One of the ways I've used this function is to scan combinations until a condition is satisfied. The iteration is always terminated before reaching the end. Reading ahead may waste computation and memory.
I don't understand. The Cartesian product calculation has to iterate over the second argument many times (unless you have it iterate over the first argument many times). So a lazy argument won't work. Am I missing something?
All I want is something that will raise an exception if any argument but the first is not re-iterable (e.g. my reiter() proposal). I'll add list() to the argument myself if I really want to. Don't try to guess what I meant.
Actually, I don't want to reiterate this debate. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, Aug 28, 2002 at 10:27:30AM -0400, Guido van Rossum wrote:
Ahh... re-iterability again...
This is a good example of a function that *fails silently* for non re-iterable arguments.
This failure is hardly silent IMO: the results are totally bogus, which is a pretty good clue that something's wrong.
Sure, at the interactive prompt or very shallow code it is obvious. Exceptions are noisy. Anything else is silent.
Slurping the tail into a list loses the lazy efficiency of this function. One of the ways I've used this function is to scan combinations until a condition is satisfied. The iteration is always terminated before reaching the end. Reading ahead may waste computation and memory.
I don't understand. The Cartesian product calculation has to iterate over the second argument many times (unless you have it iterate over the first argument many times). So a lazy argument won't work. Am I missing something?
Even if all the arguments are re-iterable containers the recursive call produces a lazy generator object - the cartesian product of the tail. I don't want to read it eagerly into a list. Oren
Even if all the arguments are re-iterable containers the recursive call produces a lazy generator object - the cartesian product of the tail. I don't want to read it eagerly into a list.
And I wasn't proposing that. def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] tail = list(tail) # <--- This is what I was proposing for x in cartesian(*head): for y in tail: yield x + [y] --Guido van Rossum (home page: http://www.python.org/~guido/)
[Oren Tirosh]
My implementation from http://www.tothink.com/python/dataflow/xlazy.py: [...] Special-casing the 2 argument case helps a lot.
Good idea! I added such speed-up code for cartesians, subsets and permutations, and am now seeking how to do it (nicely) for combinations and arrangements as well.
Slurping the tail into a list loses the lazy efficiency of this function.
Generators are an elegant way to be lazy. I agree that we are likely to loose something if we attempt to do too much, too soon. -- François Pinard http://www.iro.umontreal.ca/~pinard
On Wed, Aug 28, 2002 at 11:02:25AM -0400, Guido van Rossum wrote:
Even if all the arguments are re-iterable containers the recursive call produces a lazy generator object - the cartesian product of the tail. I don't want to read it eagerly into a list.
And I wasn't proposing that.
def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] tail = list(tail) # <--- This is what I was proposing for x in cartesian(*head): for y in tail: yield x + [y]
Silly me. Too much LISPthink made me automatically see "head" as the first item and "tail" as the rest. Now I see that the head is all but the last item and the tail is the last. It was really funny - like seeing the cup change into two faces right in front of your eyes... Oren
[Guido van Rossum]
def cartesian(*sequences): [...]
It occurred to me that this is rather ineffecient because it invokes itself recursively many time (once for each element in the first sequence). This version is much faster, because iterating over a built-in sequence (like a list) is much faster than iterating over a generator:
Granted, thanks! I postponed optimisations for the first draft of `cogen', but if it looks acceptable overall, we can try to get some speed of it now.
def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] for x in cartesian(*head): for y in tail: yield x + [y]
I also wonder if perhaps ``tail = list(tail)'' should be inserted just before the for loop, so that the arguments may be iterators as well.
`cogen' does not make any special effort for protecting iterators given as input sequences. `cartesian' is surely not the only place where iterators would create a problem. Solving it as a special case for `cartesian' only is not very nice. Of course, we might transform all sequences into lists everywhere in `cogen', but `list'-ing a list copies it, and I'm not sure this would be such a good idea in the end. Best may be to let the user explicitly transform input iterators into lists by calling `list' explicitly those arguments. That might be an acceptable compromise.
I expect that Eric Raymond's powerset is much faster than your recursive subsets() [...]
Very likely, yes. I started `cogen' with algorithms looking a bit all alike, and did not look at speed. OK, I'll switch. I do not have Eric's algorithm handy, but if I remember well, it merely mapped successive integers to subsets by associating each bit with an element. -- François Pinard http://www.iro.umontreal.ca/~pinard
Granted, thanks! I postponed optimisations for the first draft of `cogen', but if it looks acceptable overall, we can try to get some speed of it now.
I'm not saying that it looks good overall -- I'd like to defer to Tim, who has used and written this kind of utities for real, and who probably has a lot of useful feedback. Right now, he's felled by some kind of ilness though.
def cartesian(*sequences): if len(sequences) == 0: yield [] else: head, tail = sequences[:-1], sequences[-1] for x in cartesian(*head): for y in tail: yield x + [y]
I also wonder if perhaps ``tail = list(tail)'' should be inserted just before the for loop, so that the arguments may be iterators as well.
`cogen' does not make any special effort for protecting iterators given as input sequences. `cartesian' is surely not the only place where iterators would create a problem. Solving it as a special case for `cartesian' only is not very nice. Of course, we might transform all sequences into lists everywhere in `cogen', but `list'-ing a list copies it, and I'm not sure this would be such a good idea in the end.
Hm. All but the last will be iterated over many times. In practice the inputs will be relatively small (I can't imagine using this for sequences with 100s of 1000s elements). Or you might sniff the type and avoid the copy if you know it's a list or tuple. Or you might use Oren's favorite rule of thumb and listify it when iter(x) is iter(x) (or iter(x) is x).
Best may be to let the user explicitly transform input iterators into lists by calling `list' explicitly those arguments. That might be an acceptable compromise.
Maybe.
I expect that Eric Raymond's powerset is much faster than your recursive subsets() [...]
Very likely, yes. I started `cogen' with algorithms looking a bit all alike, and did not look at speed. OK, I'll switch. I do not have Eric's algorithm handy, but if I remember well, it merely mapped successive integers to subsets by associating each bit with an element.
def powerset(base): """Powerset of an iterable, yielding lists.""" pairs = [(2**i, x) for i, x in enumerate(base)] for n in xrange(2**len(pairs)): yield [x for m, x in pairs if m&n] --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
Right now, [Tim] is felled by some kind of illness though.
For the mere crumbs falling off of his project table, he seems to be working like hell, so no surprise his body asks him to slow down once in a while. I wish he will soon be healthy and happy again! Life is not the same when he is away...
All but the last will be iterated over many times. In practice the inputs will be relatively small (I can't imagine using this for sequences with 100s of 1000s elements).
Do not under-estimate statisticians! They are well known for burning oodles and oodles of computer cycles, contemplating or combining data sets which are not always small, in all ways imaginable. Of course, even statisticians cannot afford all subsets of sets having 100 elements, or will not scan permutations of 1000 elements. But 100s or 1000s elements are well within the bounds of cartesian products.
Or you might use Oren's favorite rule of thumb and listify it when iter(x) is iter(x) (or iter(x) is x).
I'm a bit annoyed by the idea that `iter(x)' might require some computation for producing an iterator, and that we immediately throw away the result. Granted that `__iter__(self): return self' is efficient when an object is an iterator, but nowhere it is said that `__iter__' has to be efficient when the object is a container, and it does not shock me that some complex containers require time to produce their iterator. I much prefer limiting the use of `__iter__' for when one intends to use the iterator...
def powerset(base): """Powerset of an iterable, yielding lists.""" pairs = [(2**i, x) for i, x in enumerate(base)] for n in xrange(2**len(pairs)): yield [x for m, x in pairs if m&n]
Thanks! Hmph! This does not yield the subsets in "sorted order", like the other `cogen' methods do, and I would prefer to keep that promise. Hopefully, the optimisation I added this morning will make both algorithms more comparable, speed-wise. I should benchmark them to see. -- François Pinard http://www.iro.umontreal.ca/~pinard
[Guido]
I'm not saying that it looks good overall -- I'd like to defer to Tim, who has used and written this kind of utities for real, and who probably has a lot of useful feedback. Right now, he's felled by some kind of ilness though.
I think that's over! I'm very tired, though (couldn't get to sleep until 11, and woke up at 2 with the last, umm, episode <wink>). This is a Big Project if done right. I volunteered time for it a few years ago, but there wasn't enough interest then to keep it going. I'll attach the last publicly-distribued module I had then, solely devoted to combinations. It was meant to be the first in a series, all following some basic design decisions: + A Basic class that doesn't compromise on speed, typically by working on canonical representatives in Python list-of-int form. + A more general class that deals with arbitrary sequences, perhaps at great loss of efficiency. + Multiple iterators are important: lex order is needed sometimes; Gray code order is an enormous help sometimes; random generation is vital sometimes. + State-of-the-art algorithms. That's a burden for anything that goes into the core -- if it's a toy algorithm, users can do just as well on their own, and then people submit patch after patch that the original author isn't really qualified to judge (else they would have done a state-of-the-art thing to begin with). + The ability to override the random number generator. Python's default WH generator is showing its age as machines get faster; it's simply not adequate anymore for long-running programs making heavy use of it on a fast box. Combinatorial algorithms in particular do tend to make heavy use of it. (Speaking of which, "someone" should look into grabbing one of the Mersenne Twister extensions for Python -- that's the current state of *that* art). Ideas not worth taking: + Leave the chi-square algorithm out of it. A better implementation would be nice to have in a statistics package, but it doesn't belong here regardless. me-i'm-going-back-to-sleep-ly y'rs - tim
Of course, even statisticians cannot afford all subsets of sets having 100 elements, or will not scan permutations of 1000 elements. But 100s or 1000s elements are well within the bounds of cartesian products.
Yes, and there the cost of an extra list() copy is neglegeable (allocate and copy 4K bytes).
Or you might use Oren's favorite rule of thumb and listify it when iter(x) is iter(x) (or iter(x) is x).
I'm a bit annoyed by the idea that `iter(x)' might require some computation for producing an iterator, and that we immediately throw away the result. Granted that `__iter__(self): return self' is efficient when an object is an iterator, but nowhere it is said that `__iter__' has to be efficient when the object is a container, and it does not shock me that some complex containers require time to produce their iterator. I much prefer limiting the use of `__iter__' for when one intends to use the iterator...
Yes, that's why I prefer to just make a list() copy.
def powerset(base): """Powerset of an iterable, yielding lists.""" pairs = [(2**i, x) for i, x in enumerate(base)] for n in xrange(2**len(pairs)): yield [x for m, x in pairs if m&n]
Thanks! Hmph! This does not yield the subsets in "sorted order", like the other `cogen' methods do, and I would prefer to keep that promise.
That may be a matter of permuting the bits?
Hopefully, the optimisation I added this morning will make both algorithms more comparable, speed-wise. I should benchmark them to see.
Yes! Nothing beats a benchmark as an eye-opener. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, Aug 28, 2002 at 03:41:00PM -0400, Guido van Rossum wrote:
Or you might use Oren's favorite rule of thumb and listify it when iter(x) is iter(x) (or iter(x) is x).
I'm a bit annoyed by the idea that `iter(x)' might require some computation for producing an iterator, and that we immediately throw away the result. Granted that `__iter__(self): return self' is efficient when an object is an iterator, but nowhere it is said that `__iter__' has to be efficient when the object is a container, and it does not shock me that some complex containers require time to produce their iterator. I much prefer limiting the use of `__iter__' for when one intends to use the iterator...
Yes, that's why I prefer to just make a list() copy.
Oh, come on you two... stop beating up the poor strawman. The reiter() function doesn't make a single redundant call to __iter__. It's just like iter() but ensures in passing that the result is really a fresh iterator. Oren
Oh, come on you two... stop beating up the poor strawman. The reiter() function doesn't make a single redundant call to __iter__. It's just like iter() but ensures in passing that the result is really a fresh iterator.
I'm really sorry. I had forgotten what exactly your proposal was. It is actually very reasonable: Proposal: new built-in function reiter() def reiter(obj): """reiter(obj) -> iterator Get an iterator from an object. If the object is already an iterator a TypeError exception will be raised. For all Python built-in types it is guaranteed that if this function succeeds the next call to reiter() will return a new iterator that produces the same items unless the object is modified. Non-builtin iterable objects which are not iterators SHOULD support multiple iteration returning the same items.""" it = iter(obj) if it is obj: raise TypeError('Object is not re-iterable') return it --Guido van Rossum (home page: http://www.python.org/~guido/)
TP> + The ability to override the random number generator. Python's TP> default WH generator is showing its age as machines get TP> faster; it's simply not adequate anymore for long-running TP> programs making heavy use of it on a fast box. Combinatorial TP> algorithms in particular do tend to make heavy use of it. TP> (Speaking of which, "someone" should look into grabbing one of TP> the Mersenne Twister extensions for Python -- that's the TP> current state of *that* art). The last time we talked about random number generation, I remember finding a tiny algorithm by Pierre L'Ecuyer based on a recommendation from Luc Devroye. (That's a good pedigree!) Here's an almost equally tiny C extension that wraps up the algorithm. We should do a real test of it. Last time I checked, it wasn't obvious how to actually run the DIEHARD tests. Jeremy
[Guido van Rossum]
def powerset(base):
This does not yield the subsets in "sorted order", like the other `cogen' methods do, and I would prefer to keep that promise.
That may be a matter of permuting the bits?
I looked at them: nothing evident pops up. I tried inverting, inversing, and both, in hope to trigger the sight of some magic property, to no avail. -- François Pinard http://www.iro.umontreal.ca/~pinard
[Jeremy Hylton]
The last time we talked about random number generation, I remember finding a tiny algorithm by Pierre L'Ecuyer based on a recommendation from Luc Devroye. (That's a good pedigree!) Here's an almost equally tiny C extension that wraps up the algorithm.
We should do a real test of it. Last time I checked, it wasn't obvious how to actually run the DIEHARD tests.
It still isn't, but DIEHARD is likely obsolete now. Testing for randomness has become a full-blown science in its own right. Your government is happy to give you a bunch of more modern randomness tests developed on a Sun, complete with a multi-hundred page testing manual every word of which is vitally important <0.8 wink>: http://csrc.nist.gov/rng/ Note that the Mersenne Twister is likely substantially faster than the little C program (e.g, it doesn't need division, and on some platforms is reported to be faster than the uselessly simple-minded C rand()), is provably equi-distributed through 623 dimensions (linear congruential generators are damned luck to get 6), has a period of nearly 2**20000, and is probably the most widely tested generator in existence now. Knuth was reported as saying "well, I guess that about wraps it up for random number generation!", although I'd be more likely to believe L'Ecuyer or Marsaglia on this particular topic <wink>.
Ideas for the day: 1. Optimize BaseSet._update(iterable) by checking for two special cases where a C-speed update method is already available and the entries are known in advance to be immutable: . . . if isinstance(iterable, BaseSet): self._data.update(iterable._data) return if isinstance(iterable, dict): self._data.update(iterable) return . . . 2. Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using it, directly or by coercing to a set, as an iterable:
Set('abracadabra') | 'alacazam' Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
This improves usability because the second argument did not have to be pre-wrapped with Set. It improves speed, for some operations, by using the iterable directly and not having to build an equivalent dictionary. 3. Have ImmutableSet keep a reference to the original iterable. Add an ImmutableSet.refresh() method that rebuilds ._data from the iterable. Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible. The goal is to improve the usability of sets of sets where the inner sets have been updated after the outer set was created.
inner = Set('abracadabra') outer = Set([inner]) inner.add('z') # now the outer set is out-of-date outer.refresh() # now it is current outer Set(['a', 'c', 'r', 'z', 'b', 'd'])
This would only work for restartable iterables -- a file object would not be so easily refreshed. Raymond Hettinger
3. Have ImmutableSet keep a reference to the original iterable. Add an ImmutableSet.refresh() method that rebuilds ._data from the iterable. Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible. The goal is to improve the usability of sets of sets where the inner sets have been updated after the outer set was created.
inner = Set('abracadabra') outer = Set([inner]) inner.add('z') # now the outer set is out-of-date outer.refresh() # now it is current outer Set(['a', 'c', 'r', 'z', 'b', 'd'])
Make that: Set(ImmutableSet('a', 'c', 'r', 'z', 'b', 'd']))
TP> + The ability to override the random number generator. Python's TP> default WH generator is showing its age as machines get TP> faster; it's simply not adequate anymore for long-running TP> programs making heavy use of it on a fast box. Combinatorial TP> algorithms in particular do tend to make heavy use of it. TP> (Speaking of which, "someone" should look into grabbing one of TP> the Mersenne Twister extensions for Python -- that's the TP> current state of *that* art).
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python. ----------------------------------------------------------- David Abrahams * Boost Consulting dave@boost-consulting.com * http://www.boost-consulting.com
From: "David Abrahams"
TP> + The ability to override the random number generator. Python's TP> default WH generator is showing its age as machines get TP> faster; it's simply not adequate anymore for long-running TP> programs making heavy use of it on a fast box. Combinatorial TP> algorithms in particular do tend to make heavy use of it. TP> (Speaking of which, "someone" should look into grabbing one of TP> the Mersenne Twister extensions for Python -- that's the TP> current state of *that* art).
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python.
I'm willing to implement this one. Raymond
1. Optimize BaseSet._update(iterable) by checking for two special cases where a C-speed update method is already available and the entries are known in advance to be immutable:
. . . if isinstance(iterable, BaseSet): self._data.update(iterable._data) return if isinstance(iterable, dict): self._data.update(iterable) return . . .
Yes.
2. Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using it, directly or by coercing to a set, as an iterable:
Set('abracadabra') | 'alacazam' Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
This improves usability because the second argument did not have to be pre-wrapped with Set. It improves speed, for some operations, by using the iterable directly and not having to build an equivalent dictionary.
No. This has been proposed before. I think it's a bad idea, just as [1,2,3] + "abc" is a bad idea. If you want this, it's easy enough to do s = Set('abracadabra') s.update('alacazam')
3. Have ImmutableSet keep a reference to the original iterable. Add an ImmutableSet.refresh() method that rebuilds ._data from the iterable. Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible. The goal is to improve the usability of sets of sets where the inner sets have been updated after the outer set was created.
inner = Set('abracadabra') outer = Set([inner]) inner.add('z') # now the outer set is out-of-date outer.refresh() # now it is current outer Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
This would only work for restartable iterables -- a file object would not be so easily refreshed.
This *appears* to be messing with the immutability. If I wrote: a = range(3) s1 = ImmutableSet(a) s2 = Set([s1]) a.append(4) s2.refresh() What would the value of s1 be? I think I understand your use case (the example in the docs, where an employee is added), but I think we should think harder about what to do about that. Possibly it's not a good example of how sets are used (even if it's a good example of how sets work). --Guido van Rossum (home page: http://www.python.org/~guido/)
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python.
I'm willing to implement this one.
Please do! (Have you got much experience with random number generation?) --Guido van Rossum (home page: http://www.python.org/~guido/)
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python.
I'm willing to implement this one.
Please do! (Have you got much experience with random number generation?)
Yes, but my experience is out-of-date. I've read Knuth (esp the part on testing generators), done numerical analysis, written simulations and high-end crypto, etc. The Mersenne Twister algorithm is new to me -- studying it is part of my motivation to volunteer to implement it.
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python.
I'm willing to implement this one.
Please do! (Have you got much experience with random number generation?)
Yes, but my experience is out-of-date. I've read Knuth (esp the part on testing generators), done numerical analysis, written simulations and high-end crypto, etc. The Mersenne Twister algorithm is new to me -- studying it is part of my motivation to volunteer to implement it.
Cool! You & Tim will have something to talk about. --Guido van Rossum (home page: http://www.python.org/~guido/)
2. Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using it, directly or by coercing to a set, as an iterable:
Set('abracadabra') | 'alacazam' Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
This improves usability because the second argument did not have to be pre-wrapped with Set. It improves speed, for some operations, by using the iterable directly and not having to build an equivalent dictionary.
No. This has been proposed before. I think it's a bad idea, just as
[1,2,3] + "abc"
is a bad idea.
I see the wisdom in preventing weirdness. The real motivation was to get sets.py to play nicely with other set implementations. Right now, it can only interact with instances of BaseClass. And, even if someone subclasses BaseClass, they currently *must* have a self._data attribute that is a dictionary. This prevents non-dictionary based extensions.
3. Have ImmutableSet keep a reference to the original iterable. Add an ImmutableSet.refresh() method that rebuilds ._data
from
the iterable. Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible. The goal is to improve the usability of sets of sets where the inner sets have been updated after the outer set was created.
inner = Set('abracadabra') outer = Set([inner]) inner.add('z') # now the outer set is out-of-date outer.refresh() # now it is current outer Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
This would only work for restartable iterables -- a file object would not be so easily refreshed.
This *appears* to be messing with the immutability. If I wrote:
a = range(3) s1 = ImmutableSet(a) s2 = Set([s1]) a.append(4) s2.refresh()
What would the value of s1 be?
Hmm, I intended to have s1.refresh() return a new object for use in s2 while leaving s1 alone (being immutable and all). Now, I wonder if that was the right thing to do. The answer lies in use cases for algorithms that need sets of sets. If anyone knows off the top of their head that would be great; otherwise, I seem to remember that some of that business was found in compiler algorithms and graph packages.
(How about limiting our lines to 72 characters?)
Eliminate the binary sanity checks which verify for operators that 'other' is a BaseSet. If 'other' isn't a BaseSet, try using it, directly or by coercing to a set, as an iterable:
Set('abracadabra') | 'alacazam' Set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
This improves usability because the second argument did not have to be pre-wrapped with Set. It improves speed, for some operations, by using the iterable directly and not having to build an equivalent dictionary.
No. This has been proposed before. I think it's a bad idea, just as
[1,2,3] + "abc"
is a bad idea.
I see the wisdom in preventing weirdness. The real motivation was to get sets.py to play nicely with other set implementations. Right now, it can only interact with instances of BaseClass. And, even if someone subclasses BaseClass, they currently *must* have a self._data attribute that is a dictionary. This prevents non-dictionary based extensions.
I've thought of (and I think I even posted) a different way to accomplish the latter, *if* *and* *when* it becomes necessary. Here it is again: - BaseSet becomes a true abstract class. I don't care if it has dummy methods that raise NotImplementedError, but the set of operations it stands for should be documented. I propose that it should stand for only the published operations of ImmutableSet. Other set implementations can then derive from BaseSet. - The implementation currently in BaseSet is moved to a new internal class, e.g. _CoreSet, which derives from BaseSet. - Set and ImmutableSet derive from _CoreSet. - The binary operators (and sundry other places as needed) make *two* checks for the 'other' argument: - If it is a _CoreSet instance, do what's currently done, taking a shortcut knowing the implementation. - Otherwise, if it is a BaseSet instance, implement the operation using only the published set API. Example: def __or__(self, other): if isinstance(other, _CoreSet): result = self.__class__(self._data) result._data.update(other._data) return result if isinstance(other, BaseSet): result = self.__class__(self._data) result.update(other) return NotImplemented This effectively makes BaseSet a protocol. I realize that there is some resistance to using inheritance from a designated abstract base class as a way to indicate that a class implements a given protocol; but since we don't have other solutions in place, I think this is a reasonable solution. Trying to sniff whether the other argument implements a set protocol by testing the presence of specific APIs seems awkward, especially since most set APIs (__or__ etc.) are heavily overloaded by types that aren't sets at all.
Have ImmutableSet keep a reference to the original iterable. Add an ImmutableSet.refresh() method that rebuilds ._data from the iterable. Add a Set.refresh() method that triggers ImmutableSet.refresh() where possible. The goal is to improve the usability of sets of sets where the inner sets have been updated after the outer set was created.
inner = Set('abracadabra') outer = Set([inner]) inner.add('z') # now the outer set is out-of-date outer.refresh() # now it is current outer Set([ImmutableSet(['a', 'c', 'r', 'z', 'b', 'd'])])
This would only work for restartable iterables -- a file object would not be so easily refreshed.
This *appears* to be messing with the immutability. If I wrote:
a = range(3) s1 = ImmutableSet(a) s2 = Set([s1]) a.append(4) s2.refresh()
What would the value of s1 be?
Hmm, I intended to have s1.refresh() return a new object for use in s2 while leaving s1 alone (being immutable and all). Now, I wonder if that was the right thing to do. The answer lies in use cases for algorithms that need sets of sets. If anyone knows off the top of their head that would be great; otherwise, I seem to remember that some of that business was found in compiler algorithms and graph packages.
Let's call YAGNI on this one. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Raymond Hettinger]
... Hmm, I intended to have s1.refresh() return a new object for use in s2 while leaving s1 alone (being immutable and all). Now, I wonder if that was the right thing to do. The answer lies in use cases for algorithms that need sets of sets. If anyone knows off the top of their head that would be great; otherwise, I seem to remember that some of that business was found in compiler algorithms and graph packages.
There's no real use case I know of for having a mutation of a set element propagate to the set containing it. Sets in Python are collections of values, not collections of object ids (sets in Icon are collections of object ids, and, e.g., Set([[], []]) in Icon is a set with two elements). Value semantics darned near require copying, or fancier copy on write, under the covers, and value semantics are most useful for sets of sets. Once the value has been established, you want to guarantee it never changes, not make it easy to change it by accident <wink>.
I'm sketching out an approach to the Mersenne Twister and
wanted to make sure it is in line with what you want.
-- Write it in pure python as a drop-in replacement for Wichman-Hill.
-- Add a version number argument to Random() which defaults to two.
If set to one, use the old generator so that it is possible to recreate
sequences from earlier versions of Python. Note, the code is much
shorter if we drop this requirement. On the plus side, it gives more
than backwards compatability, it gives the ability to re-run a
simulation with another generator to assure that the result isn't
a fluke related to a generator design flaw.
-- Document David Abrahams's link to
http://www.boost.org/boost/random/mersenne_twister.hpp as the
reference implementation and
http://www.math.keio.ac.jp/matumoto/emt.html as a place for
more information. Key-off of the MT19337 version as the most
recent stable evolution.
-- Move the existing in-module test-suite into a unittest. Add a new,
separate unittest suite with tests specific to MT (recreating a few
sequences produced by reference implementations) and with a battery
of Knuth style tests. The validation results are at:
http://www.math.keio.ac.jp/matumoto/ver991029.html
-- When we're done, have a python link put on the Mersenne Twister
Home Page (the second link above).
-- Write, test and document the generator first. Afterwards, explore
techniques for creating multiple independent streams:
http://www.math.h.kyoto-u.ac.jp/~matumoto/RAND/DC/dc.html
Raymond
----- Original Message -----
From: "Guido van Rossum"
FWIW, in case "someone" cares: http://www.boost.org/libs/random/index.html It's a nice library architecture, designed and implemented by people who know the domain, and I think it should be applicable to Python.
I'm willing to implement this one.
Please do! (Have you got much experience with random number generation?)
Yes, but my experience is out-of-date. I've read Knuth (esp the part on testing generators), done numerical analysis, written simulations and high-end crypto, etc. The Mersenne Twister algorithm is new to me -- studying it is part of my motivation to volunteer to implement it.
Cool! You & Tim will have something to talk about.
--Guido van Rossum (home page: http://www.python.org/~guido/)
Why not wrap the existing C implementation? I think a wrapper has two advantages. We get to reuse the existing implementation, without worry for transliteration errors. We also get better performance. Jeremy
-- Write it in pure python as a drop-in replacement for Wichman-Hill.
Yup. I think the seed arguments are different though -- MT takes a single int, while whrandom takes three ints in range(256).
-- Add a version number argument to Random() which defaults to two. If set to one, use the old generator so that it is possible to recreate sequences from earlier versions of Python. Note, the code is much shorter if we drop this requirement. On the plus side, it gives more than backwards compatability, it gives the ability to re-run a simulation with another generator to assure that the result isn't a fluke related to a generator design flaw.
I think this is useful. But I'd like to hear what Tim has to say.
-- Document David Abrahams's link to http://www.boost.org/boost/random/mersenne_twister.hpp as the reference implementation and
Hm. What part of that file contains the actual algorithm? I gues the
function void mersenne_twister
http://www.math.keio.ac.jp/matumoto/emt.html as a place for more information. Key-off of the MT19337 version as the most recent stable evolution.
Sure. It would be nice to have at least *some* documentation in-line in case those links disappear. Maybe you can quote the relevant C++ code from the Boost version (with attribution) in a comment.
-- Move the existing in-module test-suite into a unittest. Add a new, separate unittest suite with tests specific to MT (recreating a few sequences produced by reference implementations) and with a battery of Knuth style tests. The validation results are at: http://www.math.keio.ac.jp/matumoto/ver991029.html
It might be fun to have some heavy duty tests (which take hours or days to run) checked in but not run by default. We usually do this by not naming the test file test_foo.py; it can then be run manually.
-- When we're done, have a python link put on the Mersenne Twister Home Page (the second link above).
Sounds like they would be only too eager to comply. :-)
-- Write, test and document the generator first. Afterwards, explore techniques for creating multiple independent streams: http://www.math.h.kyoto-u.ac.jp/~matumoto/RAND/DC/dc.html
Isn't that trivial if you follow the WH implementation strategy which stores all the state in a class instance? --Guido van Rossum (home page: http://www.python.org/~guido/)
[Raymond Hettinger]
... but my experience is out-of-date. I've read Knuth (esp the part on testing generators),
Knuth is behind the times here. Better: http://csrc.nist.gov/rng/ But if you're folding in the Twister, you don't have to test ab initio -- you just have to make sure the test vector they supply produces the results they say it should.
done numerical analysis, written simulations and high-end crypto, etc. The Mersenne Twister algorithm is new to me -- studying it is part of my motivation to volunteer to implement it.
It's been implmented for Python several times already (in Python code, and as a C extension). This is more of an integration and API task than a write-code-from-scratch task. Do visit the authors' home page: http://www.math.keio.ac.jp/~matumoto/emt.html Indeed, the authors have been reduced <wink> to announcing "yet another Python module ...". Note that a subtle weakness was discovered in the seed initialization early this year, so that implementations older than that are suspect on this count. BTW, Knuth's lagged Fibonacci generator turned out to have the same kind of initialization weakness, and was corrected in the ninth printing of Vol 2: http://www-cs-faculty.stanford.edu/~knuth/news.html If this stuff interests you <wink>, Ivan Frohne (a statistician) wrote a wonderful pure-Python random-number package several years ago, including a pure Python implementation of the Twister, and several other stronger-than-WH (0, 1) base generators. It's hard to keep track of that package -- and of Ivan. This may be the most recent version: http://www.frohne.westhost.com/
[Jeremy]
Why not wrap the existing C implementation? I think a wrapper has two advantages. We get to reuse the existing implementation, without worry for transliteration errors. We also get better performance.
On the plus side, it gives a chance to write a pure C helper function for creating many random numbers at a time. On the minus side, random number generation is a much disputed topic, occassionly requiring full disclosure of seeds and source. Having the code in random.py makes it more visible than burying it in the C code. The C code I saw is covered by a BSD license -- I don't know if that's an issue or not. As for implementation difficulty or accuracy, the code is so short and clear that there isn't a savings from re-using the C code. Raymond
On Thu, 29 Aug 2002, Raymond Hettinger wrote:
-- Add a version number argument to Random() which defaults to two.
Why not have a WHRandom() and a MersenneRandom() instance inside module random? That way you can even give a future behavior warning that Random() is about to change and people can either choose the particular generator they want or accept the default. To my mind, this is a case of explicit (actually naming the generator types) is better than implicit (version number? Where's my documentation? Which generator is which version?) Maybe this isn't a big deal now, but I can believe that we might accumulate another RNG or two (there are some good reasons to want *weaker* or correlated RNGs) and having a weaker generator with a *later* version number is just bound to cause havoc. -a
-- Add a version number argument to Random() which defaults to two.
Why not have a WHRandom() and a MersenneRandom() instance inside module random? That way you can even give a future behavior warning that Random() is about to change and people can either choose the particular generator they want or accept the default.
To my mind, this is a case of explicit (actually naming the generator types) is better than implicit (version number? Where's my documentation? Which generator is which version?) Maybe this isn't a big deal now, but I can believe that we might accumulate another RNG or two (there are some good reasons to want *weaker* or correlated RNGs) and having a weaker generator with a *later* version number is just bound to cause havoc.
Hm, I hadn't realized that the random.Random class doesn't import the whrandom module but simply reimplements it. Here's an idea. class BaseRandom: implements the end user methods: randrange(), choice(), normalvariate(), etc., except random(), which is an abstract method, raising NotImplementedError. class WHRandom and class MersenneRandom: add the specific random number generator implementation, as random(). Random is an alias for the random generator class of the day, currently MersenneRandom. Details: can MersenneRandom support jumpahead()? Should it support whseed(), which is provided only for backwards compatibility? If someone pickles a Random instance with Python 2.2 and tries to unpickle it with Python 2.3, this will fail, because (presumably) the state for MersenneRandom is different from the state for WHRandom. Perhaps there should be a module-level call to make Random an alias for WHRandom rather than for MersenneRandom. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Raymond Hettinger]
I'm sketching out an approach to the Mersenne Twister and wanted to make sure it is in line with what you want.
-- Write it in pure python as a drop-in replacement for Wichman-Hill.
I'd rather the Twister were in C -- it's low-level bit-fiddling, and Python isn't well-suited to high-throughput bit fiddling. IIRC, Ivan Frohne eventually got his pure-Python Twister implementation (see earlier msg) to within 15% of whrandom's speed, but it *could* be 10x faster w/o even trying.
-- Add a version number argument to Random() which defaults to two. If set to one, use the old generator so that it is possible to recreate sequences from earlier versions of Python. Note, the code is much shorter if we drop this requirement. On the plus side, it gives more than backwards compatability,
Backwards compatability is essential, at least in the sense that there's *some* way to exactly reproduce results from earlier Python releases. The seed function in whrandom.py was very weak (for reasons explained in random.py), and I did make that incompatible when improving it, but also introduced whseed so that there was *some* way to reproduce older results bit-for-bit. I've gotten exactly one complaint about the incompatibility since then. Note that new generators are intended to be introduced via sublassing of Random: # Specific to Wichmann-Hill generator. Subclasses wishing to use a # different core generator should override the seed(), random(), # getstate(), setstate() and jumpahead() methods. I don't believe the jumpahead() method can be usefully implemented for the Twister.
it gives the ability to re-run a simulation with another generator to assure that the result isn't a fluke related to a generator design flaw.
That's very important in real life. Ivan Frohne's design (and alternative generators) should be considered here.
-- Document David Abrahams's link to http://www.boost.org/boost/random/mersenne_twister.hpp as the reference implementation and http://www.math.keio.ac.jp/matumoto/emt.html as a place for more information. Key-off of the MT19337 version as the most recent stable evolution.
I'd simply use the authors' source code. You don't get bonus points for ripping out layers of C++ templates someone else gilded the lily under <wink>.
-- Move the existing in-module test-suite into a unittest.
Easier said than done. The test suite doesn't do anything except print results -- it has no intelligence about whether the results are good or bad. It was expected that a bona fide expert would stare at the output.
Add a new, separate unittest suite with tests specific to MT (recreating a few sequences produced by reference implementations)
I doubt you need a distinct test file for that.
and with a battery of Knuth style tests.
They're far behind the current state of the art, and, as Knuth mentions in an exercise, it's "term project" level effort to implement them even so.
The validation results are at: http://www.math.keio.ac.jp/matumoto/ver991029.html
-- When we're done, have a python link put on the Mersenne Twister Home Page (the second link above).
Yes! Cool idea.
-- Write, test and document the generator first.
As opposed to what, eating dinner first <wink>?
Afterwards, explore techniques for creating multiple independent streams: http://www.math.h.kyoto-u.ac.jp/~matumoto/RAND/DC/dc.html
Agreed that should be delayed.
[Guido]
Here's an idea.
class BaseRandom: implements the end user methods: randrange(), choice(), normalvariate(), etc., except random(), which is an abstract method, raising NotImplementedError.
That's fine, and close to the intended <wink> way to extend this. BaseRandom should also leave as abstract seed(), getstate(), setstate() (but should implement __getstate__ and __setstate__ -- Random already does this correctly), and jumpahead().
class WHRandom and class MersenneRandom: add the specific random number generator implementation, as random().
seed, getstate, setstate and jumpahead are also initimately connected to the details of the base generator, so need also to be supplied by subclasses.
Random is an alias for the random generator class of the day, currently MersenneRandom.
I like that better than what we've got now -- I would like to say that Random may vary across releases, as the state of the art advances, so that naive users are far less likely to fool themselves. But users also need to force use of a specific generator at times, and this scheme caters to both.
Details: can MersenneRandom support jumpahead()?
Yes, but I don't think *usefully*. That is, I doubt it could do it faster than calling the base random() N times. jumpahead() is easy to implement efficiently for linear congruential generators, and possible to implement efficiently for pure lagged Fibonacci generators, but that's about it.
Should it support whseed(), which is provided only for backwards compatibility?
It should not. whseed() shouldn't even be used by WHRandom users! It's solely for bit-for-bit reproducibility of an older and weaker scheme.
If someone pickles a Random instance with Python 2.2 and tries to unpickle it with Python 2.3, this will fail, because (presumably) the state for MersenneRandom is different from the state for WHRandom.
That's in part why the concrete classes have to implement getstate() and setstate() appropriately. Note that every 2.2 Random pickle starts with the integer 1 (the then-- and now --value of Random.VERSION). That's enough clue so that all future Python versions can know which flavor of random pickle they're looking at. A Twister pickle should start with some none-1 integer.
Perhaps there should be a module-level call to make Random an alias for WHRandom rather than for MersenneRandom.
I suppose it would also have to replace all the other module globals appropriately. I'm thinking of the _inst = Random() seed = _inst.seed random = _inst.random uniform = _inst.uniform randint = _inst.randint choice = _inst.choice randrange = _inst.randrange shuffle = _inst.shuffle normalvariate = _inst.normalvariate lognormvariate = _inst.lognormvariate cunifvariate = _inst.cunifvariate expovariate = _inst.expovariate vonmisesvariate = _inst.vonmisesvariate gammavariate = _inst.gammavariate stdgamma = _inst.stdgamma gauss = _inst.gauss betavariate = _inst.betavariate paretovariate = _inst.paretovariate weibullvariate = _inst.weibullvariate getstate = _inst.getstate setstate = _inst.setstate jumpahead = _inst.jumpahead whseed = _inst.whseed cruft at the end here.
[Raymond]
-- Write, test and document the generator first. Afterwards, explore techniques for creating multiple independent streams: http://www.math.h.kyoto-u.ac.jp/~matumoto/RAND/DC/dc.html
[Guido]
Isn't that trivial if you follow the WH implementation strategy which stores all the state in a class instance?
"Independent" has more than one meaning. The implementation meanings you have in mind (two instances of the generator don't share state, and nothing one does affects what the other does) are indeed trivially achieved via attaching all state to an instance. A different meaning of "independent" is "statistically uncorrelated", and that's more what the link is aiming at. It's never easy to get multiple, statistically independent streams. For example, using WH's jumpahead, it's possible that you pick a large number N, jumpahead(N) in one instance of a WH generator, and then the two streams turn out to be perfectly correlated. That's trivially so if you pick N to be a multiple of WH's period, but there are smaller values of N that also suffer. One reason to run a simulation multiple times with distinct generators is that it's pragmatically impossible to outguess all this stuff. Two generators are a sanity check; three can break the impasse when the first two deliver significantly different results; four is clearly excessive <wink>.
[Raymond Hettinger]
... The C code I saw is covered by a BSD license -- I don't know if that's an issue or not.
That's fine, provided it doesn't have the dreaded "advertising clause". I personally don't care whether it does -- it's the FSF that has bug up their butt about that one. I expect we'd have to reproduce their copyright notice in the docs somewhere; yup: 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. I think we *ought* to perform a similar courtesy for, e.g., the Tcl/Tk and zlib components shipped with the Python Windows installer too.
As for implementation difficulty or accuracy, the code is so short and clear that there isn't a savings from re-using the C code.
That isn't the point here. If you use Nishimura and Matsumoto's code as close to verbatim as possible, then that's the perfect answer to your earlier point:
On the minus side, random number generation is a much disputed topic, occassionly requiring full disclosure of seeds and source.
Nothing *could* be more fully disclosed than their source code: it's extremely well known to every worker in the field, and has gotten critical review from the smartest eyeballs in the world.
On Thu, Aug 29, 2002 at 10:43:28PM -0400, Tim Peters wrote:
[Raymond Hettinger]
... The C code I saw is covered by a BSD license -- I don't know if that's an issue or not.
That's fine, provided it doesn't have the dreaded "advertising clause". I personally don't care whether it does -- it's the FSF that has bug up their butt about that one. I expect we'd have to reproduce their copyright notice
The Mersenne Twister distribution is now under the Artistic License. http://www.math.keio.ac.jp/~matumoto/eartistic.html Oren
[Oren Tirosh[]
The Mersenne Twister distribution is now under the Artistic License.
That's out of date. When they updated the code earlier this year to fix the initialization weakness, they also adpoted a new license: MT19337 with initialization improved 2002/1/26 The initialization scheme of older versions of MT has a (small) problem, that MSBs are not well reflected to the state vector. Here is the latest version of initialization scheme, which we consider the newest standard. An initialization routine using an array of seeds is also available. We adopted BSD-license which we think most flexible, so this code is freely usable. That's much less problematic than the Artistic License (which Stallman holds in contempt; his objection to BSD+advertising_clause is so technical it's hard to give a damn).
participants (17)
-
Aahz
-
Andrew P. Lentvorski
-
Brett Cannon
-
David Abrahams
-
Eric S. Raymond
-
Greg Ewing
-
Guido van Rossum
-
Jeremy Hylton
-
Magnus Lie Hetland
-
Martin Sjögren
-
Michael McLay
-
Oren Tirosh
-
Paul Svensson
-
pinard@iro.umontreal.ca
-
Raymond Hettinger
-
Tim Peters
-
Tim Peters