Clustering technique

Tue Dec 22 06:59:52 EST 2009

On Dec 22, 11:12 am, Luca <nioski... at yahoo.it> wrote:
> Dear all, excuse me if i post a simple question.. I am trying to find
> a software/algorythm that can "cluster" simple data on an excel sheet
>
> Example:
>                 Variable a   Variable b   Variable c
> Case 1        1                   0              0
> Case 2        0                   1              1
> Case 3        1                   0              0
> Case 4        1                   1              0
> Case 5        0                   1              1
>
> The systems recognizes that there are 3 possible clusters:
>
> the first with cases that has Variable a as true,
> the second has Variables b and c
> the third is "all the rest"
>
>         Variabile a    Variabile b   Variabile c
>
> Case 1     1               0            0
> Case 3     1               0            0
>
> Case 2     0               1            1
> Case 5     0               1            1
>
> Case 4     1               1            0
>
> Thank you in advance

If you haven't already, download and install xlrd from http://www.python-excel.org
for a library than can read excel workbooks (but not 2007 yet).

Or, export as CSV...

Then using either the csv module/xlrd (both well documented) or any
other way of reading the data, you effectively want to end up with
something like this:

rows = [
     #A       #B #C #D
    ['Case 1', 1, 0 ,0],
    ['Case 2', 0, 1, 1],
    ['Case 3', 1, 0, 0],
    ['Case 4', 1, 1, 0],
    ['Case 5', 0, 1, 1]
]

One approach is to sort 'rows' by B,C & D. This will bring the
identical elements adjacent to each other in the list. Then you need
an iterator to group them... take a look at itertools.groupby.

Another is to use a defaultdict(list) found in collections. And just
loop over the rows, again with B, C & D as a key, and A being appended
to the list.

hth
Jon.