[Spambayes] Mail classifiers, training sets and technical docs
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Sat Dec 28 13:24:36 EST 2002
There are people that are much more qualified to answer your questions than I,
but most everyone on this project is MIA during the holidays, so I'll give it
a go. Undoubtedly someone will add to my remarks below.
- TimS
12/28/2002 12:13:02 PM, Niek Bergboer <n.bergboer at cs.unimaas.nl> wrote:
>Hello,
>
>Like many others, I suffer from Spam, and while surfing the web I came
>across your Bayesian mail classifier. However, since I am also doing my
>PhD research in the field of machine learning, I am especially
>interested. Specifically, I am using machine learning techniques
>(including classifiers) on images, but the application to email seems
>interesting as well.
>
>First off, I tried to find some in-depth technical documentation about
>your system, but I was unable to find it. Could you direct me to any
>literature references or papers on which the work is based?
There's not much that's been developed yet. You can see some in-code
commentary in classifier.py and tokenizer.py. Other than that, there are a
few how-to and readme type documents in the project. That's about all at this
point, unless there's some other doc that isn't checked in. This is all based
on Paul Graham's spam article, which can be found by searching on those three
words.
>
>Being involved in machine learning, there are of course a number of
>"standard" questions that immediately pop up:
>
>Does the SpamBayes framework use any training before it gets shipped to
>the user? That is, does the user start out with a completely
>"dumb" system for which he has to provide _every_ single spam/ham
>example, or does the system come with a "basic training set" so that the
>system has some classification capabilities even before the user has
>specified any examples.
This is under discussion, but the general feeling is that there's no
universally acceptable definition of spam, that works for everyone. One man's
spam is another man's highly desirable mail...
>
>If so, what kind of training set do you use? How large is it, and what
>is the dimension of your feature space? And do you plan to make a large
>training set available?
There are training corpora that are used basically for testing the effects of
changes made in the tokenization and classification algorithms. These are not
generally available. However, it's not difficult to accumulate your own set
of spam and ham for training <wink>.
>
>In addition, I was wondering about the kind of classifiers that could
>be used. It seems to me that SpamBayes basically is a binary classifier:
>mail has to be classified as either ham ("1") or spam ("0") (or vice
>versa, if you like that better).
This is not quite true. Incoming mail is given a spam probability based upon
your own training of the database and the tokens that are in the incoming
mail. There are default probability thresholds for spam and ham, which you
can configure to be tighter or looser as you wish.
> In addition to (naive) Bayesian
>classifiers, there of course exist more. For example, a classifier that
>has been around for a while, but has only just begun to be viable (do to
>new training techniques) is the Support Vector Machine (SVM). In its
>basic form, this is a binary classifier (though multi-class problems can
>be handled as well nowadays). Theoretically, one could of course also
>use a very simple and crude (k-) Nearest Neighbor classifier, though one
>would need a large training set for this to work well.
No comment here, due to my limited qual... ;)
>
>Based on which criteria was the choice for using a Bayesian classifier
>made?
This started out as a research project to test the validity of Paul's
assertions. As such, it is highly successful. Paul proposed a Bayesian
classification. His rationale for that choice was not a subject of this
research.
>
>I wish you the best of luck and success with the project. Good work!
Thanks! I have Spambayes running on my Windoze system, with a standard off
the shelf mailer, and it works beautifully. I'm loving it... now if I could
just get my employer to use this technology...
Thanks for your questions, please drop in often! - TimS
>
>TIA,
>
>Niek
>
>--
>N.H. Bergboer - n.bergboer at cs.unimaas.nl
>University of Maastricht - +31-43-3883901
>Institute of Knowledge and Agent Technology
>
>"I have been asked, 'Pray, Mr. Babbage, if you put into the machine
> wrong figures, will the right answers come out?' I am not able to
> rightly apprehend the kind of confusion of ideas that could provoke
> such a question." Charles Babbage
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes at python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org
More information about the Spambayes
mailing list