Grouping pairs - suggested tools

Tue Sep 21 05:41:46 EDT 2010

* Arnaud Delobelle, on 21.09.2010 11:13:
> On Sep 21, 7:19 am, "Alf P. Steinbach /Usenet"<alf.p.steinbach
> +use... at gmail.com>  wrote:
>> * Alf P. Steinbach /Usenet, on 21.09.2010 01:09:
>>
>>
>>
>>
>>
>>> * Astley Le Jasper, on 20.09.2010 23:42:
>>>> I have a list of tuples that indicate a relationship, ie a is related
>>>> to b, b is related to c etc etc. What I want to do is cluster these
>>>> relationships into groups. An item will only be associated with a
>>>> single cluster.
>>
>>>> Before I started, I wondered if there was any particular tool within
>>>> Python I should be looking at. I don't expect anyone to code this for
>>>> me, just say ... "you need to look at using x". I was going to use
>>>> populate a dictionary and
>>
>>>> Sorry for being so vague.
>>
>>>> Example Data:
>>>> [(a,b)
>>>> (a,c)
>>>> (a,d)
>>>> (b,c)
>>>> (b,d)
>>>> (c,d)
>>>> (e,f)
>>>> (e,g)
>>>> (f,g)
>>>> (h,i)]
>>
>>>> Output (grouping id, item ref)
>>>> (1,a),
>>>> (1,b),
>>>> (1,c),
>>>> (1,d),
>>>> (2,e),
>>>> (2,f),
>>>> (2,g),
>>>> (3,h),
>>>> (3,i)
>>
>>> It seems to be the same problem as "equivalence sets".
>>
>>> This problem was solved early on because e.g. Fortran compilers had to construct
>>> such sets (equivalence partitions of a set).
>>
>>> I though I'd just say "Google it!", because I know there's a standard algorithm
>>> but I can't recall the name. However, googling it I found no mention of that
>>> algorithm. Not even in the Wikipedia articles on equivalence sets.
>>
>>> A number of approaches spring to mind:
>>
>>> Approach 1:
>>> Multi-pass. Originally you assign a distinct set number to each symbol.
>>> In each pass through the symbols you replace one number with another as
>>> per one of the equivalence specs. Stop when no replacements are done.
>>
>>> Approach 2:
>>> Merging. For each new equivalence A=B you check whether A has been assigned
>>> to a set, if not you create a new one, call that S1, and ditto for B, S2.
>>> Merge S1 and S2, e.g. move all elements of S2 to S1.
>>
>>> Approach 3:
>>> In a first pass convert the data to more explicit form, linking each symbol
>>> to the symbols it's directly equivalent to. Then in second pass simply drag
>>> out each equivalence group (recurse on each symbol's list of equivalences).
>>
>>> Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
>>> but I think approach 1 may be a practical option for a small number of symbols.
>>
>> Uhm, thinking about it (it must have been my unconscious mind doing the work, it
>> just popped into my head), if you first sort each individual equivalence
>> relation so that you never have e.g. C=A but only A=C and so on, and then sort
>> the list of equivalences, then it should reduce to walking the list and just
>> starting a new set whenever a symbol is encountered that isn't yet in a set.
>>
>> <code language="Py3">
>> class Attributes: pass
>>
>> sym = Attributes()
>> for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
>>       setattr( sym, name, name )
>>
>> eq_specs = [
>>       (sym.a, sym.d),
>>       (sym.b, sym.a),
>>       (sym.b, sym.c),
>>       (sym.b, sym.d),
>>       (sym.c, sym.d),
>>       (sym.c, sym.a),
>>       (sym.e, sym.f),
>>       (sym.e, sym.g),
>>       (sym.f, sym.g),
>>       (sym.h, sym.i),
>>       ]
>>
>> equalities = []
>> for eq in eq_specs:
>>       sorted_eq = eq if eq[0]<= eq[1] else (eq[1], eq[0])
>>       equalities.append( sorted_eq )
>> equalities.sort()
>>
>> eq_sets = {}
>> eq_set_ids = {}
>> current_eq_set_id = 0
>> for eq in equalities:
>
> This would make the body of the loop easier to read:
>
>    for x, y in equalities:
>
>>       if eq_set_ids.get( eq[0] ) is None:
>
> Why not simply:
>
>         if eq[0] in eq_set_ids:
>
>>           current_eq_set_id += 1
>>           eq_set_ids[eq[0]] = current_eq_set_id
>>           eq_sets[current_eq_set_id] = set( eq[0] )
>>       eq_set_id = eq_set_ids[eq[0]]
>>       eq_set_ids[eq[1]] = eq_set_id
>>       eq_sets[eq_set_id].add( eq[1] )
>>
>> for eq_set_id in eq_sets.keys():
>>       for sym_name in eq_sets[eq_set_id]:
>>           print( "{}, {}".format( eq_set_id, sym_name ) )
>> </code>
>>
>> <output>
>> 1, a
>> 1, c
>> 1, b
>> 1, d
>> 2, e
>> 2, g
>> 2, f
>> 3, i
>> 3, h
>> </output>
>>
>> Disclaimer: for me it's pretty late in day/night.
>>
>> Cheers&  hth.,
>>
>> - Alf
>>
>> --
>> blog at<url:http://alfps.wordpress.com>
>
> I think this won't work with the following graph:
>
> eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]
>
> Note that it is already sorted according to your sorting method.  I
> don't have a Python machine to check this but I believe it will
> output:
>
> 1, a
> 1, c
> 1, d
> 2, b
>
> The flaw is that you fail to consider that the two vertices in the
> current pair may already be part of a (partial) connected component.

Yeah, thanks.

I think the three general methods I listed will nicely work, though.

Moral: don't post insights from dream-world... ;-)

Cheers,

- Alf

-- 
blog at <url: http://alfps.wordpress.com>