[Python-Dev] I think my set module is ready for prime time; comments?

Eric S. Raymond esr@thyrsus.com
Tue, 23 Jan 2001 11:30:50 -0500


--tKW2IUtsqtDRztdT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Guido van Rossum <guido@digicool.com>: 
> I understand that you have probably given this more thought than I
> have recently, so I'd like to see your more detailed analysis of what
> you do and don't like about Greg's proposal!

I've already covered my big objection, the fact that it doesn't
support the degree of polymorphic crossover one might expect with
sequence types (and Greg has agreed that I have a point there).
Another problem is the lack of support for mutable elements (and yes,
I'm quite aware of the problems with this.)

One thing I do like is the proposal for an actual set input syntax.  Of course
this would require that the set type become one of the builtins, with 
compiler support.

> I haven't read your docs yet (and no time because Digital Creations is
> requiring my attention all of today), but I expect that designing a
> universal set type, one that is good enough to be used in all sorts of
> applications, is very difficult.  

For "difficult" read "can't be done".  This is one of those cases where
no matter what implementation you choose, some of the operations you want
to be cheap will be worst-case quadratic.  Life is like that.  So I chose
a dead-simple representation and accepted quadratic times for 
union/intersection.

> > 2. It's simple for application programmers to use.  No extension module
> > to integrate.
> 
> This is a silly argument for wanting something to be added to the
> core.  If it's part of the core, the need for an extension is
> immaterial because that extension will always be available.  So
> I conclude that your module is set up perfectly for a popular module
> in the Vaults. :-)

Reasonable point.
 
> > 3. It's unsurprising.  My set objects behave almost exactly like other
> > mutable sequences, with all the same built-in methods working, except for 
> > the fact that you can't introduce duplicates with the mutators.
> 
> Ah, so you see a set as an extension of a sequence.  That may be the
> big rift between your version and Greg's PEP: are sets more like
> sequences or more like dictionaries?

Indeed it is.  

> > 5. It's simple enough not to cause you maintainance hassles down the
> > road, and even if it did the maintainer is unlikely to disappear :-).
> 
> I'll be the judge of that, and since you prefer not to show your
> source code (why is that?), I can't tell yet.

No nefarious concealment going on here here :-), I've sent versions of
the code to Greg and Ping already.  I'll shoot you a copy too.
 
> Having just skimmed your docs, I'm disappointed that you choose lists
> as your fundamental representation type -- this makes it slow to test
> for membership and hence makes intersection and union slow.

Not quite.  Membership test is still linear-time; so is adding and deleting
elements.  It's true that union and intersection are quadratic, but see below.

>                                                      I suppose
> that you have evidence from using this that those operations aren't
> used much, or not for large sets?

Exactly!  In my experience the usage pattern of a class like this runs
heavily to small sets (usually < 64 elements); membership tests
dominate usage, with addition and deletion of elements running second
and the "classical" boolean operations like union and intersection
being uncommon.

What you get by going with a dictionary representation is that
membership test becomes close to constant-time, while insertion and
deletion become sometimes cheap and sometimes quite expensive
(depending of course on whether you have to allocate a new 
hash bucket).  Given the usage pattern I described, the overall
difference in performance is marginal.

>                              This is one of the problems with
> coming up with a set type for the core: it has to work for (nearly)
> everybody.

As I pointed out above (and someone else on the list had made the same point
earlier), "works for everbody" isn't really possible here.  So my solution
does the next best thing -- pick a choice of tradeoffs that isn't obviously
worse than the alternatives and keeps things bog-simple.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Alcohol still kills more people every year than all `illegal' drugs put
together, and Prohibition only made it worse.  Oppose the War On Some Drugs!

--tKW2IUtsqtDRztdT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="set.py"

"""
A set-algebra module for Python.

The functions work on any sequence type and return lists.
The set methods can take a set or any sequence type as an argument.
They are insensitive to the types of the elements.

Lists are used rather than dictionaries so the elements can be mutable.

"""
# Design and implementation by ESR, January 2001.

def setify(list1):		# Used by set constructor
    "Remove duplicates in sequence."
    res = []
    for i in range(len(list1)):
	duplicate = 0
        for j in range(i):
	    if list1[i] == list1[j]:
		duplicate = 1
		break
	if not duplicate:
	    res.append(list1[i])
    return res

def union(list1, list2):		# Used for |
    "Compute set intersection of sequences."
    res = list1[:]
    for x in list2:
	if not x in list1:
	    res.append(x)
    return res

def intersection(list1, list2):		# Used for &
    "Compute set intersection of sequences."
    res = []
    for x in list1:
	if x in list2:
	    res.append(x)
    return res

def difference(list1, list2):		# Used for -
    "Compute set difference of sequences."
    res = []
    for x in list1:
	if not x in list2:
	    res.append(x)
    return res

def symmetric_difference(list1, list2):	# Used for ^
    "Compute set symmetric-difference of sequences."
    res = []
    for x in list1:
	if not x in list2:
	    res.append(x)
    for x in list2:
	if not x in list1:
	    res.append(x)
    return res

def cartesian(list1, list2):		# Used for *
    "Cartesian product of sequences considered as sets."
    res = []
    for x in list1:
	for y in list2:
	    res.append((x,y))
    return res

def equality(list1, list2):
    "Test sequences considered as sets for equality."
    if len(list1) != len(list2):
        return 0
    for x in list1:
        if not x in list2:
            return 0
    for x in list2:
        if not x in list1:
            return 0
    return 1

def proper_subset(list1, list2):
    "Return 1 if first argument is a proper subset of second, 0 otherwise."
    if not len(list1) < len(list2):
        return 0
    for x in list1:
        if not x in list2:
            return 0
    return 1

def subset(list1, list2):
    "Return 1 if first argument is a subset of second, 0 otherwise."
    if not len(list1) <= len(list2):
        return 0
    for x in list1:
        if not x in list2:
            return 0
    return 1

def powerset(base):
    "Compute the set of all subsets of a set."
    powerset = []
    for n in xrange(2 ** len(base)):
	subset = []
	for e in xrange(len(base)):
	     if n & 2 ** e:
		subset.append(base[e])
	powerset.append(subset)
    return powerset

class set:
    "Lists with set-theoretic operations."

    def __init__(self, value):
        self.elements = setify(value)

    def __len__(self):
	return len(self.elements)

    def __getitem__(self, ind):
	return self.elements[ind]

    def __setitem__(self, ind, val):
        if val not in self.elements:
            self.elements[ind] = val

    def __delitem__(self, ind):
	del self.elements[ind]

    def list(self):
        return self.elements

    def append(self, new):
        if new not in self.elements:
            self.elements.append(new)

    def extend(self, new):
	self.elements.extend(new)
        self.elements = setify(self.elements)

    def count(self, x):
	self.elements.count(x)

    def index(self, x):
	self.elements.index(x)

    def insert(self, i, x):
        if x not in self.elements:
            self.elements.index(i, x)

    def pop(self, i=None):
	self.elements.pop(i)

    def remove(self, x):
	self.elements.remove(x)

    def reverse(self):
	self.elements.reverse()

    def sort(self, cmp=None):
	self.elements.sort(cmp)

    def __or__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return set(union(self.elements, other))

    __add__ = __or__

    def __and__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return set(intersection(self.elements, other))

    def __sub__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return set(difference(self.elements, other))

    def __xor__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return set(symmetric_difference(self.elements, other))

    def __mul__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return set(cartesian(self.elements, other))

    def __eq__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return self.elements == other

    def __ne__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return self.elements != other

    def __lt__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return proper_subset(self.elements, other)

    def __le__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return subset(self.elements, other)

    def __gt__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return proper_subset(other, self.elements)

    def __ge__(self, other):
	if type(other) == type(self):
	    other = other.elements
        return subset(other, self.elements)

    def __str__(self):
        res = "{"
        for x in self.elements:
            res = res + str(x) + ", "
        res = res[0:-2] + "}"
        return res

    def __repr__(self):
        return repr(self.elements)

if __name__ == '__main__':
    a = set([1, 2, 3, 4])
    b = set([1, 4])
    c = set([5, 6])
    d = [1, 1, 2, 1]
    print `d`, "setifies to", set(d)
    print `a`, "|", `b`, "is", `a | b`
    print `a`, "^", `b`, "is", `a ^ b`
    print `a`, "&", `b`, "is", `a & b`
    print `b`, "*", `c`, "is", `b * c`
    print `a`, '<', `b`, "is", `a < b`
    print `a`, '>', `b`, "is", `a > b`
    print `b`, '<', `c`, "is", `b < c`
    print `b`, '>', `c`, "is", `b > c`
    print "Power set of", `c`, "is", powerset(c)

# end

--tKW2IUtsqtDRztdT--