# help in algorithm

Paolino paolo_veronelli at libero.it
Thu Aug 11 22:23:13 CEST 2005

```Bengt Richter wrote:
> On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <paolo_veronelli at tiscali.it> wrote:
>
>
>>I have  a self organizing net which aim is clustering words.
>>Let's think the clustering is about their 2-grams set.
>>Words then are instances of this class.
>>
>>class clusterable(str):
>>  def __abs__(self):# the set of q-grams (to be calculated only once)
>>    return set([(self+self[0])[n:n+2] for n in range(len(self))])
>>  def __sub__(self,other): # the q-grams distance between 2 words
>>    set1=abs(self)
>>    set2=abs(other)
>>    return len(set1|set2)-len(set1&set2)
>>
>>I'm looking  for the medium  of a set of words, as the word  which
>>minimizes the sum of the distances from those words.
>>
>>Aka:sum([medium-word for word in words])
>>
>>
>>Thanks for ideas, Paolino
>>
>
> Just wondering if this is a desired result:
>
>  >>> clusterable('banana')-clusterable('bananana')
>  0

Yes, the clustering is the main filter,it's good (I hope) to cut the
space of words down one or two magnitudes.
Final choices must be done with the expensive Levenstain distance, or
other edit-type distance.

Now I'm using an empirical solution where I suppose the best set has
lenght L equal the medium of the lenghts.Then I choose from the
frequency distribution of 2-grams the first L 2-grams.

I have no clue this is the right set and I'm sure that set is not a word
as there is no chance to chain those 2-grams to form a word.