list of unique non-subset sets

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Thu Mar 17 20:09:40 EST 2005


s1 = set(['a','b','c'])
s2 = set(['a','c'])
s3 = set(['a','d','e','f'])
s4 = set(['r','k','l'])
s5 = set(['r','k','l'])
ls = [s1,s2,s3,s4,s5]
result1 = [s1, s3, s5]

A result can contain s4 or s5 at random. This problem looks like the
one of searching the correct hyperedges for a hypergraph. s1-5 are the
hyperedges, and "a", "b",... are the nodes. Probably this problem is
already solved somewhere.

. # You can start removing the duplicated sets (slow operation):
. ls2 = list(set(map(frozenset, ls)))
. # Then I've found a solution similar to the first Raymond Hettinger
one:
. result = []
. for si in ls2:
.     for sj in result:
.         if si <= sj:
.             break
.     else:
.         result.append(si)
. print result

This can be slow, but it probably can be made a bit faster version, but
it's not easy. You can put the nodes in a undirected graph that
represents the hypergraph, with other "extra nodes" that represent the
hyperedges already in result).

. print list(enumerate(map(list,ls2)))
. # Output:
[(0,['k','r','l']),(1,['a','c','b']),(2,['a','e','d','f']),(3,['a','c'])]
. # Probably you can work on ls too, avoiding the slow creation of ls2,
but you have to cheek
. # more things later.
. import graph
. g = graph.Graph()
. for n1,s in enumerate(ls2):
.     for n2 in s:
.         g.addBiArc(n1, n2) # add edges and nodes as needed
.
. # then you can find the connected components of the graph
. print g # Output: 0-k,l,r  1-a,b,c  2-a,d,e,f  3-a,c
. cc = g.connectedComponents()
. # now cc = [[0, 'k', 'r', 'l'], [1, 'a', 'c', 'b', 2, 3, 'e', 'd',
'f']]
. # and you can filter the hyperedges. This can be improved a lot:
. ccf = [ [n for n in c if type(n)==int] for c in cc ]
. # now ccf = [[0], [1, 2, 3]]

Eppstein unionFind data structure can also be used for a similar
purpose. 0-th element of ls2 is set(['r','k','l']). It's totally
disjoint from the others. Now you don't have to look every sj in
result, but only the sj of result that are inside its connected ones.
For example, you don't have to test set(['a','c']) against
set(['r','k','l']). I don't know how much faster this can be compared
to the simple O(len(ls)^2) program seen before... But it can be useful
for situations with lots of sets.

Bye,
Bearophile




More information about the Python-list mailing list