Grouping pairs - suggested tools

Alf P. Steinbach /Usenet alf.p.steinbach+usenet at gmail.com
Tue Sep 21 01:09:09 CEST 2010


* Astley Le Jasper, on 20.09.2010 23:42:
> I have a list of tuples that indicate a relationship, ie a is related
> to b, b is related to c etc etc. What I want to do is cluster these
> relationships into groups.  An item will only be associated with a
> single cluster.
>
> Before I started, I wondered if there was any particular tool within
> Python I should be looking at. I don't expect anyone to code this for
> me, just say ... "you need to look at using x". I was going to use
> populate a dictionary and
>
> Sorry for being so vague.
>
> Example Data:
> [(a,b)
> (a,c)
> (a,d)
> (b,c)
> (b,d)
> (c,d)
> (e,f)
> (e,g)
> (f,g)
> (h,i)]
>
> Output (grouping id, item ref)
> (1,a),
> (1,b),
> (1,c),
> (1,d),
> (2,e),
> (2,f),
> (2,g),
> (3,h),
> (3,i)

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct 
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm 
but I can't recall the name. However, googling it I found no mention of that 
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

   Approach 1:
   Multi-pass. Originally you assign a distinct set number to each symbol.
   In each pass through the symbols you replace one number with another as
   per one of the equivalence specs. Stop when no replacements are done.

   Approach 2:
   Merging. For each new equivalence A=B you check whether A has been assigned
   to a set, if not you create a new one, call that S1, and ditto for B, S2.
   Merge S1 and S2, e.g. move all elements of S2 to S1.

   Approach 3:
   In a first pass convert the data to more explicit form, linking each symbol
   to the symbols it's directly equivalent to. Then in second pass simply drag
   out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols, 
but I think approach 1 may be a practical option for a small number of symbols.


Cheers & hth.,

- Alf

-- 
blog at <url: http://alfps.wordpress.com>



More information about the Python-list mailing list