[spambayes-dev] An unrelated idea: categorization / clusteranalysis of text files for FAQ generating

Tim Peters tim.one at comcast.net
Thu Sep 25 10:15:41 EDT 2003


[Anssi Porttikivi]
> Sorry to bother you, but I would like to know, if anyone here has any
> knowledge of technologies like the following idea:
>
> Could you automatically categorize a set of messages into an optimum
> number of cluster subsets, where messages inside a subset would be
> similar to each other, in bayesian filtering terms. If this could be
> done without a priori manually selecting the categories that the
> clusters subset are, this could be used for an automated "frequently
> asked questions" list manitenance. Automatic categorization of
> incoming mail without manually choosing any criteria beforehand would
> also be interesting.

There is (of course) a large literature on cluster analysis.  Here's a very
readable intro:

    http://www.statsoftinc.com/textbook/stcluan.html

Code up one of the 50 known methods, and see which works best <wink>.




More information about the spambayes-dev mailing list