Clustering text-documents in bundles

Paul Rubin http
Tue Sep 25 19:52:57 CEST 2007


"exhuma.twn" <exhuma at gmail.com> writes:
> Is it possible to calculate a distance between two chunks of text? I
> suppose one could simply do a simple word-count on the chunks
> (removing common noise words of course). And then go from there. Maybe
> even assigning different weighting to words. But maybe there is a well-
> tested and useful algorithm already available?

There's a huge field of text mining that attempts to do things like
this.  http://en.wikipedia.org/wiki/Latent_semantic_analysis for some
info about one approach.  Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is 
a standard reference about text processing.  They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.



More information about the Python-list mailing list