Clustering text-documents in bundles

Paul Rubin http
Tue Sep 25 19:52:57 CEST 2007

"exhuma.twn" <exhuma at> writes:
> Is it possible to calculate a distance between two chunks of text? I
> suppose one could simply do a simple word-count on the chunks
> (removing common noise words of course). And then go from there. Maybe
> even assigning different weighting to words. But maybe there is a well-
> tested and useful algorithm already available?

There's a huge field of text mining that attempts to do things like
this. for some
info about one approach.  Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" ( is 
a standard reference about text processing.  They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <>.

More information about the Python-list mailing list