[Spambayes] Mail classifiers, training sets and technical docs

Tim Peters tim.one at comcast.net
Sat Dec 28 14:42:50 EST 2002


[Niek Bergboer]
> Like many others, I suffer from Spam, and while surfing the web I came
> across your Bayesian mail classifier. However, since I am also doing my
> PhD research in the field of machine learning, I am especially
> interested. Specifically, I am using machine learning techniques
> (including classifiers) on images, but the application to email seems
> interesting as well.
>
> First off, I tried to find some in-depth technical documentation about
> your system, but I was unable to find it. Could you direct me to any
> literature references or papers on which the work is based?

There are extensive comments in the source code, and some articles about
this project will appear in Linux Journal "soon" (anyone know exactly
when?).  In the meantime, there are links to follow at

    http://spambayes.sourceforge.net/background.html

The links to Paul Graham's and Gary Robinson's articles are essential
reading.

> Being involved in machine learning, there are of course a number of
> "standard" questions that immediately pop up:
>
> Does the SpamBayes framework use any training before it gets shipped to
> the user?

This project hasn't had an alpha release yet, and initial training remains a
bit of a mystery.  Note that this approach isn't trying to "find spam" --
it's trying to separate spam from ham, based on samples of both.  The great
strength of the system is that what constitutes ham varies by individual,
and it isn't generally possible to guess that (e.g., I've got no use for
frequent-flyer solicitations, but my boss does -- they're all spam to me).

> That is, does the user start out with a completely "dumb" system for
> which he has to provide _every_ single spam/ham example,

Doubt it.  Any way at all of training has appeared to work very well, be
that feeding it every email you get, or just feeding it mistakes, or even
letting it train on its own decisions, correcting only the most egregious
errors.  That's for an individual's email.  Attempts to train a single
classifier for use by more than one user don't work as well, unless the user
group has a lot in common.  For example, I've gotten superb results on
training a classifier for comp.lang.python postings, with error rates (of
both kinds) so close to 0 that the difference can't be measured reliably
across my 34,000 c.l.py test messages (20K ham and 14K spam).

> or does the system come with a "basic training set" so that the
> system has some classification capabilities even before the user has
> specified any examples.

The system doesn't come with anything yet.  If you install it and try it, I
predict you'll see good results within 24 hours of starting, and excellent
results within a week (on your own email, and provided you take (just) a
little care in training).

> If so, what kind of training set do you use?

Most people use their personal email.  As above, I started the project with
a random sampling of newsgroup postings, and a spam archive available over
the web.

> How large is it,

People have tried training sets ranging from 1 message to over 50,000.

> and what is the dimension of your feature space?

Essentially unbounded.  The raw text of the message body is broken by
whitespace, and each resulting piece is "a feature".  Many other kinds of
tokens are generated for header lines, embedded URLs, etc.

> And do you plan to make a large training set available?

There are many public spam archives available, so no on that count.  It
works better if people use their own spam anyway (for example, that's the
only way to pick up header clues unique to their ISP).  We can't supply a
large training set of ham because what constitutes ham is specific to the
user.  It "would be nice" to seed a database with some set of msgs everyone
would agree are ham, but that's surprisingly difficult to arrive at.

> In addition, I was wondering about the kind of classifiers that could
> be used. It seems to me that SpamBayes basically is a binary classifier:
> mail has to be classified as either ham ("1") or spam ("0") (or vice
> versa, if you like that better).

We generate a score from 0.0 (ham) to 1.0 (spam).  One thing we've found is
that it's important to have an Unsure category too:  some messages are
highly ambiguous, scoring high for both haminess and spaminess, or scoring
low for both.  This system is lost then, and all evidence to date suggests
that all other systems are also lost on such msgs too (indeed, we've often
argued about such examples on this mailing list, sometimes doing a
ridiculous amount of reserach to figure out whether a msg in question was
ham or spam).  The lovely thing is that this system is very good about
*knowing* when it's lost, and such msgs really need human judgment.

> In addition to (naive) Bayesian classifiers,

Note that this system really has nothing in common with Bayesian
classifiers.  It got that name from Paul Graham's original essay, and if you
follow the links you'll figure out why it got that name, why that name was
dubious, and why this variation moved ever faruther away from it.

> there of course exist more. For example, a classifier that has been
> around for a while, but has only just begun to be viable (do to
> new training techniques) is the Support Vector Machine (SVM). In its
> basic form, this is a binary classifier (though multi-class problems
> can be handled as well nowadays). Theoretically, one could of course
< also use a very simple and crude (k-) Nearest Neighbor classifier,
> though one would need a large training set for this to work well.

AFAIK, nobody on this mailing list has pursued those alternatives.

> Based on which criteria was the choice for using a Bayesian classifier
> made?

Purely on results (although, again, this isn't a Bayesian classifier).  The
results on my initial c.l.py test data, and on my own email, are so good
that I see no way they can be improved.  It does a better job than I can do,
and in the very rare cases it makes a mistake, I haven't been able to
conceive of a way that any system could do better:  such msgs seem
intractable.  For example, one of the three false positives (out of 20K ham)
in my c.l.py test is a quote of an entire Nigerian scam spam, prefaced by a
one-line comment essentially saying "hey, this is spam".  That it was a
comment added by a real person makes the msg formally ham, and I realize
that because I've got "real world knowledge" about what the words mean.
Statistically, though, it's indistinguishable from Nigerian scam spam.  If
the comment had been made by a frequent c.l.py poster, that would have been
enough to knock the msg into the Unsure category.  But the poster is unique
in the c.l.py data, so the msg had almost no redeeming features (it *did*
have a few mild ham clues in the headers, but that's all).

OTOH, the system doesn't work so well for other kinds of "many users" apps.
Tech mailing lists have some kind of focus, and commercial advertsing on
them that isn't specific to the list topic is *always* spam.  Individuals'
own email contains many kinds of solicited commercial email, and if a common
classifier has to be trained to accept Expedia email for me, it's going to
have a hard time blocking Hotel Discount Card spam for you.  Such seemingly
fine distinctions don't appear to be a problem for a one-user classifier,
although the first time or two I get marketing collateral from a company I
do business with, it usually scores as Unsure.

> I wish you the best of luck and success with the project. Good work!

Thanks, Niek.  You're welcome to use it too, you know <wink>.




More information about the Spambayes mailing list