Clustering text-documents in bundles

Paul Hankin paul.hankin at gmail.com
Tue Sep 25 11:39:00 EDT 2007


On Sep 25, 4:11 pm, "exhuma.twn" <exh... at gmail.com> wrote:
> Is it possible to calculate a distance between two chunks of text? I
> suppose one could simply do a simple word-count on the chunks
> (removing common noise words of course). And then go from there. Maybe
> even assigning different weighting to words. But maybe there is a well-
> tested and useful algorithm already available?

A good distance between two chunks of text is the number of changes
you have to make to one to transform it to the other. You should look
at 'difflib' with which you should be able to code up this sort of
distance (although the details will depend just on what your text
looks like).

--
Paul Hankin




More information about the Python-list mailing list