[Spambayes] Mail classifiers, training sets and technical docs

Niek Bergboer n.bergboer at cs.unimaas.nl
Sat Dec 28 19:13:02 EST 2002


Hello,

Like many others, I suffer from Spam, and while surfing the web I came
across your Bayesian mail classifier. However, since I am also doing my
PhD research in the field of machine learning, I am especially
interested. Specifically, I am using machine learning techniques
(including classifiers) on images, but the application to email seems
interesting as well.

First off, I tried to find some in-depth technical documentation about
your system, but I was unable to find it. Could you direct me to any
literature references or papers on which the work is based?

Being involved in machine learning, there are of course a number of
"standard" questions that immediately pop up:

Does the SpamBayes framework use any training before it gets shipped to
the user? That is, does the user start out with a completely
"dumb" system for which he has to provide _every_ single spam/ham
example, or does the system come with a "basic training set" so that the
system has some classification capabilities even before the user has
specified any examples.

If so, what kind of training set do you use? How large is it, and what
is the dimension of your feature space? And do you plan to make a large
training set available?

In addition, I was wondering about the kind of classifiers that could
be used. It seems to me that SpamBayes basically is a binary classifier:
mail has to be classified as either ham ("1") or spam ("0") (or vice
versa, if you like that better). In addition to (naive) Bayesian
classifiers, there of course exist more. For example, a classifier that
has been around for a while, but has only just begun to be viable (do to
new training techniques) is the Support Vector Machine (SVM). In its
basic form, this is a binary classifier (though multi-class problems can
be handled as well nowadays). Theoretically, one could of course also
use a very simple and crude (k-) Nearest Neighbor classifier, though one
would need a large training set for this to work well.

Based on which criteria was the choice for using a Bayesian classifier
made? 

I wish you the best of luck and success with the project. Good work!

TIA,

Niek

-- 
N.H. Bergboer                               - n.bergboer at cs.unimaas.nl
University of Maastricht                    - +31-43-3883901
Institute of Knowledge and Agent Technology 

"I have been asked, 'Pray, Mr. Babbage, if you put into the machine 
 wrong figures, will the right answers come out?' I am not able to 
 rightly apprehend the kind of confusion of ideas that could provoke 
 such a question."                                 Charles Babbage
  



More information about the Spambayes mailing list