I think my set module is ready for prime time; comments?

\section{\module{set} --- Basic set algebra for Python} \declaremodule{standard}{set} \modulesynopsis{Basic set algebra operations on sequences.} \moduleauthor{Eric S. Raymond}{esr@thyrsus.com} \sectionauthor{Eric S. Raymond}{esr@thyrsus.com} The \module{set} module defines functions for treating lists and other sequences as mathematical sets, and defines a set class that uses these operations natively and overloads Python's standard operator set. The \module{set} functions work on any sequence type and return lists. The set methods can take a set or any sequence type as an argument. Set or sequence elements may be of any type and may be mutable. Comparisons and membership tests of elements against sequence objects are done using \keyword{in}, and so can be customized by supplying a suitable \method{__getattr__} method for the sequence type. The running time of these functions is O(n**2) in the worst case unless otherwise noted. For cases that can be short-circuited by cardinality comparisons, this has been done. \begin{funcdesc}{setify}{list1} Returns a list of the argument sequence's elements with duplicates removed. \end{funcdesc} \begin{funcdesc}{union}{list1, list2} Set union. All elements of both sets or sequences are returned. \end{funcdesc} \begin{funcdesc}{intersection}{list1, list2} Set intersection. All elements common to both sets or sequences are returned. \end{funcdesc} \begin{funcdesc}{difference}{list1, list2} Set difference. All elements of the first set or sequence not present in the second are returned. \end{funcdesc} \begin{funcdesc}{symmetric_difference}{list1, list2} Set symmetric difference. All elements present in one sequence or the other but not in both are returned. \end{funcdesc} \begin{funcdesc}{cartesian}{list1, list2} Returns a list of tuples consisting of all possible pairs of elements from the first and second sequences or sets. \end{funcdesc} \begin{funcdesc}{equality}{list1, list2} Set comparison. Return 1 if the two sets or sequences contain exactly the same elements, 0 or otherwise. \end{funcdesc} \begin{funcdesc}{subset}{list1, list2} Set subset test. Return 1 if all elements of the fiorst set or sequence are members of the second, 0 otherwise. \end{funcdesc} \begin{funcdesc}{proper_subset}{list1, list2} Set subset test, excluding equality. Return 1 if the arguments fail a set equality test, and all elements of the fiorst set or sequence are members of the second, 0 otherwise. \end{funcdesc} \begin{funcdesc}{powerset}{list1} Return the set of all subsets of the argument set or sequence. Warning: this produces huge results from small arguments and is O(2**n) in both running time and space requirements; you can readily run yourself out of memory using it. \end{funcdesc} \subsection{set Objects \label{set-objects}} A \class{set} instance uses the \module{set} module functions to implement set semantics on the list it contains, and to support a full set of Python list methods and operaors. Thus, the set methods can take a set or any sequence type as an argument. A set object contains a single data member: \begin{memberdesc}{elements} List containing the elements of the set. \end{memberdesc} Set objects can be treated as mutable sequences; they support the special methods \method{__len__}, \method{__getattr__}, \method{__setattr__}, and \method{__delattr__}. Through \method{__getattr__}, they support the memebership test via \keyword{in}. All the standard mutable-sequence methods \method{list}, \method{append}, \method{extend}, \method{count}, \method{index}, \method{insert} (the index argument is ignored), \method{pop}, \method{remove}, \method{reverse}, and \method{sort} are also supported. After method calls that add elements (\method{setattr}, \method{append}, \method{extend}, \method{insert}), the elements of the data member are re-setified, so it is not possible to introduce duplicates. Calling \function{repr()} on a set returns the result of calling \function{repr} on its element list. Calling \function{str()} returns a representation resembling mathematical notation for the set; an open set bracket, followed by a comma-separated list of \function{str()} representations of the elements, followed by a close set brackets. Set objects support the following Python operators: \begin {tableiii}{l|l|l}{code}{Operator}{Function}{Description} \lineiii{|,+}{union}{Union} \lineiii{&}{intersection}{Intersection} \lineiii{-}{difference}{Difference} \lineiii{^}{symmetric_difference}{Symmetric differe} \lineiii{*}{cartesian}{Cartesian product} \lineiii{==}{equality}{Equality test} \lineiii{!=,<>}{}{Inequality test} \lineiii{<}{proper_subset}{Proper-subset test} \lineiii{<=}{subset}{Subset test} \lineiii{>}{}{Proper superset test} \lineiii{>=}{}{Superset test} \end {tableiii} -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> Government is actually the worst failure of civilized man. There has never been a really good one, and even those that are most tolerable are arbitrary, cruel, grasping and unintelligent. -- H. L. Mencken

[LaTeX file] Eric, we are all hackers, but plain LaTeX is not really the right format for a posting to a mailing list... at least not if you really expect feedback ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Eric, There's already a PEP on a set object type, and everybody and their aunt has already implemented a set datatype. If *your* set module is ready for prime time, why not publish it in the Vaults of Parnassus? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@digicool.com>:
There's already a PEP on a set object type, and everybody and their aunt has already implemented a set datatype.
I've just read the PEP. Greg's proposal has a couple of problems. The biggest one is that the interface design isn't very Pythonic -- it's formally adequate, but doesn't exploit the extent to which sets naturally have common semantics with existing Python sequence types. This is bad; it means that a lot of code that could otherwise ignore the difference between lists and sets would have to be specialized one way or the other for no good reason. The only other set module I can find in the Vaults or anywhere else is kjBuckets (which I knew about before). Looks like a good design, but complicated -- and requires installation of an extension.
If *your* set module is ready for prime time, why not publish it in the Vaults of Parnassus?
I suppose that's what I'll do if you don't bless it for the standard library. But here are the reasons I suggest you should do so: 1. It supports a set of operations that are both often useful and fiddly to get right, thus enhancing the "batteries are included" effect. (I used its ancestor for representing seen-message numbers in a specialized mailreader, for example.) 2. It's simple for application programmers to use. No extension module to integrate. 3. It's unsurprising. My set objects behave almost exactly like other mutable sequences, with all the same built-in methods working, except for the fact that you can't introduce duplicates with the mutators. 4. It's already completely documented in a form suitable for the library. 5. It's simple enough not to cause you maintainance hassles down the road, and even if it did the maintainer is unlikely to disappear :-). -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> The United States is in no way founded upon the Christian religion -- George Washington & John Adams, in a diplomatic message to Malta.

"Eric S. Raymond" wrote:
Guido van Rossum <guido@digicool.com>:
There's already a PEP on a set object type, and everybody and their aunt has already implemented a set datatype.
I've just read the PEP. Greg's proposal has a couple of problems. The biggest one is that the interface design isn't very Pythonic -- it's formally adequate, but doesn't exploit the extent to which sets naturally have common semantics with existing Python sequence types. This is bad; it means that a lot of code that could otherwise ignore the difference between lists and sets would have to be specialized one way or the other for no good reason.
The only other set module I can find in the Vaults or anywhere else is kjBuckets (which I knew about before). Looks like a good design, but complicated -- and requires installation of an extension.
There's also a kjSet.py available at Aaron's site: http://www.chordate.com/kwParsing/index.html which is a pure Python version of the C extenion's kjSet type.
If *your* set module is ready for prime time, why not publish it in the Vaults of Parnassus?
I suppose that's what I'll do if you don't bless it for the standard library. But here are the reasons I suggest you should do so:
1. It supports a set of operations that are both often useful and fiddly to get right, thus enhancing the "batteries are included" effect. (I used its ancestor for representing seen-message numbers in a specialized mailreader, for example.)
2. It's simple for application programmers to use. No extension module to integrate.
3. It's unsurprising. My set objects behave almost exactly like other mutable sequences, with all the same built-in methods working, except for the fact that you can't introduce duplicates with the mutators.
4. It's already completely documented in a form suitable for the library.
5. It's simple enough not to cause you maintainance hassles down the road, and even if it did the maintainer is unlikely to disappear :-).
All very well, but are sets really that essential to every day Python programming ? If we include sets then we ought to also include graphs, tries, btrees and all those other goodies we have in computer science. All of these types are available out there, but I believe the audience who really cares for these types is also capable of downloading the extensions and installing them. It would be nice if all of these extension could go into a SUMO edition of Python though... together with your set module. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg <mal@lemburg.com>:
All very well, but are sets really that essential to every day Python programming ? If we include sets then we ought to also include graphs, tries, btrees and all those other goodies we have in computer science.
I use sets a lot. And there was enough demand to generate a PEP. But the wider question here is how seriously we take "batteries are included" as a design principle. Does a facility have to be useful *every day* to be worth being in the standard library? And if so, what are things like the POP3 and IMAP libraries (or, for that matter, my own shlex and netrc modules) doing there? I don't think so. I think there are at least four different possible reasons for something to be in the standard library: 1. It's useful every day. 2. It's useful less frequently than every day, but is a stable cross-platform implementation of a wheel that would otherwise have to be reinvented frequently. That is, you can solve it *once* and have a zero-maintainance increment to the power of the language. 3. It's a technique that's not often used, and not necessarily stable in the face of platform variations, but nothing else will do when you need it and it's notably difficult to get right. (popen2 and BaseHTTPServer would be good examples of this.) 4. It's a developer checklist feature that improves Python's competitive position against Perl, Tcl, and other contenders for the same ecological niche. IMO a lightweight set facility, like POP3 and IMAP, qualifies under 2 and 4 even if not under 1 and 3. This question keeps coming up in different guises. I'm often the one to raise it, because I favor an aggressive interpretation of "batteries are included" that would pull in a lot of stuff. Yes, this makes more work for us -- but I think it's work we should be doing. While minimalism is an excellent design heuristic for the core language, I think it's a bad one for the libraries. Python is a high-level language and programmers using it both expect and deserve high-level libraries -- yes, including graphs/tries/btrees and all that computer science stuff. Just as much to the point, Python competing against languages like Perl that frequently get design wins against it because of the richness of the environment *they* are willing to carry around. Guido and Tim and others are more conservative than I, which would be OK -- but it seems to me that the conservatives do not have consistent or well-thought-out criteria for what to include, which is *not* OK. We need to solve this problem. Some time back I initiated a library guidelines PEP, then dropped it due to press of overwork. But the general question is going to keep coming up and we ought to have policy guidelines that potential library developers can understand. Should I pick this up again? -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> I do not find in orthodox Christianity one redeeming feature. -- Thomas Jefferson

"Eric S. Raymond" wrote:
M.-A. Lemburg <mal@lemburg.com>:
All very well, but are sets really that essential to every day Python programming ? If we include sets then we ought to also include graphs, tries, btrees and all those other goodies we have in computer science.
I use sets a lot. And there was enough demand to generate a PEP.
Sure, but sets are fairly easy to implement using Python dictionaries -- at least at the level normally needed by Python programs. Sets, queues and graphs are examples of data types which can have many different faces; it is hard to design APIs for these which meet everyones needs.
But the wider question here is how seriously we take "batteries are included" as a design principle. Does a facility have to be useful *every day* to be worth being in the standard library? And if so, what are things like the POP3 and IMAP libraries (or, for that matter, my own shlex and netrc modules) doing there?
You can argue the same way for all kinds of extensions and packages you find in the Vaults. That's why there's demand for a different packaging of Python and this is what Moshe's PEP 206 addresses: http://python.sourceforge.net/peps/pep-0206.html
I don't think so. I think there are at least four different possible reasons for something to be in the standard library:
1. It's useful every day.
2. It's useful less frequently than every day, but is a stable cross-platform implementation of a wheel that would otherwise have to be reinvented frequently. That is, you can solve it *once* and have a zero-maintainance increment to the power of the language.
3. It's a technique that's not often used, and not necessarily stable in the face of platform variations, but nothing else will do when you need it and it's notably difficult to get right. (popen2 and BaseHTTPServer would be good examples of this.)
4. It's a developer checklist feature that improves Python's competitive position against Perl, Tcl, and other contenders for the same ecological niche.
IMO a lightweight set facility, like POP3 and IMAP, qualifies under 2 and 4 even if not under 1 and 3.
This question keeps coming up in different guises. I'm often the one to raise it, because I favor an aggressive interpretation of "batteries are included" that would pull in a lot of stuff. Yes, this makes more work for us -- but I think it's work we should be doing.
While minimalism is an excellent design heuristic for the core language, I think it's a bad one for the libraries. Python is a high-level language and programmers using it both expect and deserve high-level libraries -- yes, including graphs/tries/btrees and all that computer science stuff.
Just as much to the point, Python competing against languages like Perl that frequently get design wins against it because of the richness of the environment *they* are willing to carry around.
Guido and Tim and others are more conservative than I, which would be OK -- but it seems to me that the conservatives do not have consistent or well-thought-out criteria for what to include, which is *not* OK. We need to solve this problem.
Some time back I initiated a library guidelines PEP, then dropped it due to press of overwork. But the general question is going to keep coming up and we ought to have policy guidelines that potential library developers can understand.
Should I pick this up again?
Hmm, we already have the PEP 206 which focusses on the topic. Perhaps you could work with Moshe to sort out the "which batteries do we need" sub-topic ?! -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

M.-A. Lemburg <mal@lemburg.com>:
But the wider question here is how seriously we take "batteries are included" as a design principle. Does a facility have to be useful *every day* to be worth being in the standard library? And if so, what are things like the POP3 and IMAP libraries (or, for that matter, my own shlex and netrc modules) doing there?
You can argue the same way for all kinds of extensions and packages you find in the Vaults. That's why there's demand for a different packaging of Python and this is what Moshe's PEP 206 addresses:
Muttering "PEP 206" evades the fundamental problem rather than solving it. Not that I'm saying Moshe hasn't made a valiant effort, within the political constraint that the BDFL and others seem unwilling to confront the deeper issue. But PEP 206 is not enough. Here is why: 1. If the "Sumo" packaging ever happens, the vanilla non-Sumo version that Guido issues will quickly become of mostly theoretical interest -- because Red Hat and everybody else will move to Sumo instantly, figuring they have nothing to lose by including more features. 2. If by some change I'm wrong about 1, the outcome will be worse; we'll in effect have fragmented the language, because there won't be consistency in what library stuff is available between Sumo and non-Sumo builds on the same platform. 3. There are documentation issues as well. It's already a blot on Python that the standard documentation set doesn't cover Tkinter. In the Sumo distribution, the gap between what's installed and what's documented is likely to widen further. Developers will see this as pointlessly irritating -- and they'll be right. The stock distribution should *be* the Sumo distribution. If we're really so terrified of the extra maintainence load, then the right fix is to mark some modules and documentation as "externally maintained" with prominent pointers back to the responsible people. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> The day will come when the mystical generation of Jesus by the Supreme Being as his father, in the womb of a virgin, will be classed with the fable of the generation of Minerva in the brain of Jupiter. -- Thomas Jefferson, 1823

"Eric S. Raymond" wrote:
M.-A. Lemburg <mal@lemburg.com>:
But the wider question here is how seriously we take "batteries are included" as a design principle. Does a facility have to be useful *every day* to be worth being in the standard library? And if so, what are things like the POP3 and IMAP libraries (or, for that matter, my own shlex and netrc modules) doing there?
You can argue the same way for all kinds of extensions and packages you find in the Vaults. That's why there's demand for a different packaging of Python and this is what Moshe's PEP 206 addresses:
Muttering "PEP 206" evades the fundamental problem rather than solving it.
Not that I'm saying Moshe hasn't made a valiant effort, within the political constraint that the BDFL and others seem unwilling to confront the deeper issue. But PEP 206 is not enough. Here is why:
1. If the "Sumo" packaging ever happens, the vanilla non-Sumo version that Guido issues will quickly become of mostly theoretical interest -- because Red Hat and everybody else will move to Sumo instantly, figuring they have nothing to lose by including more features.
2. If by some change I'm wrong about 1, the outcome will be worse; we'll in effect have fragmented the language, because there won't be consistency in what library stuff is available between Sumo and non-Sumo builds on the same platform.
3. There are documentation issues as well. It's already a blot on Python that the standard documentation set doesn't cover Tkinter. In the Sumo distribution, the gap between what's installed and what's documented is likely to widen further. Developers will see this as pointlessly irritating -- and they'll be right.
The stock distribution should *be* the Sumo distribution. If we're really so terrified of the extra maintainence load, then the right fix is to mark some modules and documentation as "externally maintained" with prominent pointers back to the responsible people.
That's your POV, others think different and since this is not a democracy, the Sumo distribution is a feasable way of satisfying both needs. There are a few other issues to consider as well: * licensing is a problem (and this is also mentioned in the PEP 206) since some of the nicer additions are GPLed and thus not in the spirit of Python's closed-source friendliness which has provided it with a large user base in the commercial field * packages authors are not all the same and some may not want to split their distribution due to the integration of their package in a Sumo-distribution * the packages mentioned in PEP 206 are very complex and usually largish; maintaining them will cause much more effort compared to the standard lib modules and extensions * the build process varies widely between packages; even though we have distutils, some of the packages extend it to fit their specific needs (which is OK, but causes extra efforts in getting the build process combined) I'm not objecting to the Sumo-distribution project; to the contrary -- I tried a similar project a few years ago: the Python PowerTools distribution which you can download from: http://www.lemburg.com/python/PowerTools-0.2.zip The project died quickly though, as I wasn't able to keep up with the maintenance effort. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Tue, Jan 23, 2001 at 01:48:09PM +0100, M.-A. Lemburg wrote:
There are a few other issues to consider as well: <good list deleted>
To add a few: * The larger the amount of code in the distribution, the more effort it is maintain it all. * Minor fixes aren't available until the next Python release. For example, to drag out the XML code again: there have been two PyXML releases since Python 2.0 fixing various bugs, but someone who sticks to installing just Python will not be able to get at those bugfixes until April (when 2.1 is supposed to get finalized). If there were a core Python distribution and a sumo distribution, and the sumo distribution was the one that most people downloaded and used, that would be perfectly OK. Practically no one assembles their own Linux distribution, and that's not considered a problem. To some degree, if you're using a well-packaged Linux distribution such as Debian, you also have Python distribution mechanism with intermodule dependencies; we just have to reinvent the wheel for people on other platforms.
The project died quickly though, as I wasn't able to keep up with the maintenance effort.
Interesting. Did you get much feedback indicating that people used it much? Perhaps when you were doing that effort the Python community was composed more of self-reliant early adopter types; there are probably more newbies around now. --amk

Andrew Kuchling wrote:
On Tue, Jan 23, 2001 at 01:48:09PM +0100, M.-A. Lemburg wrote:
There are a few other issues to consider as well: <good list deleted>
To add a few:
* The larger the amount of code in the distribution, the more effort it is maintain it all.
* Minor fixes aren't available until the next Python release. For example, to drag out the XML code again: there have been two PyXML releases since Python 2.0 fixing various bugs, but someone who sticks to installing just Python will not be able to get at those bugfixes until April (when 2.1 is supposed to get finalized).
If there were a core Python distribution and a sumo distribution, and the sumo distribution was the one that most people downloaded and used, that would be perfectly OK. Practically no one assembles their own Linux distribution, and that's not considered a problem. To some degree, if you're using a well-packaged Linux distribution such as Debian, you also have Python distribution mechanism with intermodule dependencies; we just have to reinvent the wheel for people on other platforms.
The project died quickly though, as I wasn't able to keep up with the maintenance effort.
Interesting. Did you get much feedback indicating that people used it much?
Not much -- the interested parties were mostly Python experts (the lib started out as a project called expert-lib).
Perhaps when you were doing that effort the Python community was composed more of self-reliant early adopter types; there are probably more newbies around now.
True. The included packages are dated 1997-1998 -- at that time Starship was just starting to get off the ground (this are moving at a much faster pace now). The PowerTools package still uses the Makefile.pre.in mechanism (with much success though) as distutils wasn't even considered at the time. Perhaps Moshe could pick this up to have a head start for Sumo-Python ?! Some of the included packages are not available elsewhere, AFAIK, so it may well be worthwhile having a look (e.g. the LGPLed trie and btree implementations donated by John W. M. Stevens). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I've just read the PEP. Greg's proposal has a couple of problems. The biggest one is that the interface design isn't very Pythonic -- it's formally adequate, but doesn't exploit the extent to which sets naturally have common semantics with existing Python sequence types. This is bad; it means that a lot of code that could otherwise ignore the difference between lists and sets would have to be specialized one way or the other for no good reason.
Actually, I thought that Greg's proposal has some charm: it seems to be using a natural extension of the existing dictionary syntax, where a set is a dictionary without the values. I haven't thought about this deeply enough, but I see a lot of potential here. I understand that you have probably given this more thought than I have recently, so I'd like to see your more detailed analysis of what you do and don't like about Greg's proposal!
The only other set module I can find in the Vaults or anywhere else is kjBuckets (which I knew about before). Looks like a good design, but complicated -- and requires installation of an extension.
If *your* set module is ready for prime time, why not publish it in the Vaults of Parnassus?
I suppose that's what I'll do if you don't bless it for the standard library. But here are the reasons I suggest you should do so:
1. It supports a set of operations that are both often useful and fiddly to get right, thus enhancing the "batteries are included" effect. (I used its ancestor for representing seen-message numbers in a specialized mailreader, for example.)
I haven't read your docs yet (and no time because Digital Creations is requiring my attention all of today), but I expect that designing a universal set type, one that is good enough to be used in all sorts of applications, is very difficult.
2. It's simple for application programmers to use. No extension module to integrate.
This is a silly argument for wanting something to be added to the core. If it's part of the core, the need for an extension is immaterial because that extension will always be available. So I conclude that your module is set up perfectly for a popular module in the Vaults. :-)
3. It's unsurprising. My set objects behave almost exactly like other mutable sequences, with all the same built-in methods working, except for the fact that you can't introduce duplicates with the mutators.
Ah, so you see a set as an extension of a sequence. That may be the big rift between your version and Greg's PEP: are sets more like sequences or more like dictionaries?
4. It's already completely documented in a form suitable for the library.
Much appreciated.
5. It's simple enough not to cause you maintainance hassles down the road, and even if it did the maintainer is unlikely to disappear :-).
I'll be the judge of that, and since you prefer not to show your source code (why is that?), I can't tell yet. [...time flows...] Having just skimmed your docs, I'm disappointed that you choose lists as your fundamental representation type -- this makes it slow to test for membership and hence makes intersection and union slow. I suppose that you have evidence from using this that those operations aren't used much, or not for large sets? This is one of the problems with coming up with a set type for the core: it has to work for (nearly) everybody. It's no big deal if the Vaults contain three or more set modules -- perfect even, people can choose the best one for their purpose. But in the core, there's only room for one set type or module. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@digicool.com>:
I understand that you have probably given this more thought than I have recently, so I'd like to see your more detailed analysis of what you do and don't like about Greg's proposal!
I've already covered my big objection, the fact that it doesn't support the degree of polymorphic crossover one might expect with sequence types (and Greg has agreed that I have a point there). Another problem is the lack of support for mutable elements (and yes, I'm quite aware of the problems with this.) One thing I do like is the proposal for an actual set input syntax. Of course this would require that the set type become one of the builtins, with compiler support.
I haven't read your docs yet (and no time because Digital Creations is requiring my attention all of today), but I expect that designing a universal set type, one that is good enough to be used in all sorts of applications, is very difficult.
For "difficult" read "can't be done". This is one of those cases where no matter what implementation you choose, some of the operations you want to be cheap will be worst-case quadratic. Life is like that. So I chose a dead-simple representation and accepted quadratic times for union/intersection.
2. It's simple for application programmers to use. No extension module to integrate.
This is a silly argument for wanting something to be added to the core. If it's part of the core, the need for an extension is immaterial because that extension will always be available. So I conclude that your module is set up perfectly for a popular module in the Vaults. :-)
Reasonable point.
3. It's unsurprising. My set objects behave almost exactly like other mutable sequences, with all the same built-in methods working, except for the fact that you can't introduce duplicates with the mutators.
Ah, so you see a set as an extension of a sequence. That may be the big rift between your version and Greg's PEP: are sets more like sequences or more like dictionaries?
Indeed it is.
5. It's simple enough not to cause you maintainance hassles down the road, and even if it did the maintainer is unlikely to disappear :-).
I'll be the judge of that, and since you prefer not to show your source code (why is that?), I can't tell yet.
No nefarious concealment going on here here :-), I've sent versions of the code to Greg and Ping already. I'll shoot you a copy too.
Having just skimmed your docs, I'm disappointed that you choose lists as your fundamental representation type -- this makes it slow to test for membership and hence makes intersection and union slow.
Not quite. Membership test is still linear-time; so is adding and deleting elements. It's true that union and intersection are quadratic, but see below.
I suppose that you have evidence from using this that those operations aren't used much, or not for large sets?
Exactly! In my experience the usage pattern of a class like this runs heavily to small sets (usually < 64 elements); membership tests dominate usage, with addition and deletion of elements running second and the "classical" boolean operations like union and intersection being uncommon. What you get by going with a dictionary representation is that membership test becomes close to constant-time, while insertion and deletion become sometimes cheap and sometimes quite expensive (depending of course on whether you have to allocate a new hash bucket). Given the usage pattern I described, the overall difference in performance is marginal.
This is one of the problems with coming up with a set type for the core: it has to work for (nearly) everybody.
As I pointed out above (and someone else on the list had made the same point earlier), "works for everbody" isn't really possible here. So my solution does the next best thing -- pick a choice of tradeoffs that isn't obviously worse than the alternatives and keeps things bog-simple. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> Alcohol still kills more people every year than all `illegal' drugs put together, and Prohibition only made it worse. Oppose the War On Some Drugs!

[Eric S. Raymond]
... What you get by going with a dictionary representation is that membership test becomes close to constant-time, while insertion and deletion become sometimes cheap and sometimes quite expensive (depending of course on whether you have to allocate a new hash bucket).
Note that Python's dicts aren't vulnerable to that: they use open addressing in a contiguous, preallocated vector. There are no mallocs() or free()s going on for lookups, deletes, or inserts, unless an insert happens to hit a "time to double the size of the vector" boundary. Deletes never cost more than a lookup; inserts never more unless the table-size boundary is hit (one in 2**N unique inserts, at which point N goes up too).
... "works for everbody" isn't really possible here. So my solution does the next best thing -- pick a choice of tradeoffs that isn't obviously worse than the alternatives and keeps things bog-simple.
I agree that this shouldn't be an either/or choice, but if it's going to be forced into that mold I have to protest that the performance of unordered lists would kill most of the set applications I've ever had. I typically have a small number of very large sets (and I'm talking not 100s, but often 100s of 1000s of elements). The relatively large memory burden of a dict representation wouldn't bother me unless I instead had 100s of 1000s of very small sets. which-we-may-happen-in-my-next-life-but-not-in-this-one-ly y'rs - tim

[Guido]
... It's no big deal if the Vaults contain three or more set modules -- perfect even, people can choose the best one for their purpose.
They really can't, not realistically, unless all the modules in question conform to the same interface (which users can't control), and users restrict themselves to methods defined only in the interface (which users can control). The problem is that "their purpose" changes over time, and in some cases the effects of representation on performance simply can't be out-guessed in advance of actual measurement. If people need to change any more than just the import statement, *then* a single implementation has to be all things to all people. I hate to say this (bet <wink>?), but I suspect the fact that Python's basic types are all builtin and not classes has kept us from fully appreciating the class-based "1 interface, N implementations" approach that C++ and Java hackers are having so much fun with. They're not all that easy to find, but people who have climbed the steep STL learning curve often end up in the same ecstatic trance I used to see only among fellow Pythoneers.
But in the core, there's only room for one set type or module.
I don't like the conclusion: it implies there's no room in the core for more than one implementation of anything, yet one-size-fits-all doesn't. I have no problem with the idea that there's only room for one Set *interface* in the core. Then you only need Pronounce on a reasonable set of abstract operations, and leave the implementation tradeoffs to be made by different people in different ways (I've really got no use for Eric's list-based sets; he's really got no use for my sets-of-sets). That said, if there can be at most one, and must be at least one, a hashtable based set is the best compromise there is, and mutable objects as elements should not be supported (they add great implementation complexity for the benefit of relatively few applications). jeremy's-set-class-couldn't-be-accused-of-overkill<wink>-ly y'rs - tim
participants (6)
-
Andrew Kuchling
-
Eric S. Raymond
-
Guido van Rossum
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Tim Peters