Clustering text-documents in bundles
Tue Sep 25 19:52:57 CEST 2007
"exhuma.twn" <exhuma at gmail.com> writes:
> Is it possible to calculate a distance between two chunks of text? I
> suppose one could simply do a simple word-count on the chunks
> (removing common noise words of course). And then go from there. Maybe
> even assigning different weighting to words. But maybe there is a well-
> tested and useful algorithm already available?
There's a huge field of text mining that attempts to do things like
this. http://en.wikipedia.org/wiki/Latent_semantic_analysis for some
info about one approach. Manning & Schutz's book "Foundations of Statistical
Natural Language Processing" (http://nlp.stanford.edu/fsnlp/) is
a standard reference about text processing. They also have a
new one about information retrieval (downloadable as a pdf) that
looks very good: <http://informationretrieval.org>.
More information about the Python-list