[Spambayes] an alternative use of filters

Wed Dec 18 17:04:39 EST 2002

[Eric S. Johansson]
> I'm working on another antispam project (camram) which converts
> e-mail from a receiver pays (traditional with or without filters) to a
> sender pays system in which proof of work postage stamps are the
> go/nogo test.
>
> Unstamped mail that isn't from a white listed address generates a
> postage to notice.  Needless to say, that can generate a lot of postage
> due notices.  In an attempt to reduce the number of postage due notices,
> I'm interested in using Spam filters to categorize the mail into three
> buckets; clearly Spam, clearly not Spam, and can't tell.  Only the can't
> tell messages will get postage due notices.
>
> So, will your filter give me this discrimination capability?

Three-way classification is the intended use of the spambayes classifier.  A
msg gets a score from 0.0 (ham) to 1.0 (spam) and there are two configurable
cutoffs:  msgs with a score below ham_cutoff are called Ham, above
spam_cutoff Spam, and any score between those Unsure.

While experience varies across test sets and care in training, in my
experience Unsures are, over time, about half spam and half ham.  A curious
and semi-encouraging thing is that they're overwhelmingly msgs *I* can't
judge at a glance either, and sometimes it's so hard to tell I just throw
the msg away as unintelligble.  I call that "semi-"encouraging because, in
conjunction with camram, I don't believe I'd want Unsures stopped from
reaching me.  For example, a common class of Unsures is commercial HTML
email from companies I do business with; e.g., last week I got an Unsure
that was an auto-generated order receipt for an online order of a software
program.  I wanted to get the receipt, but the email was very spammish, full
of ads and links for follow-on offers, and other marketing collateral.  I
doubt reply email would be seen by a human, so a postage-due scheme probably
would have dropped it into the bit bucket on both ends.

The outstanding feature of the kind of classifier we're using is that it
adjusts to an individual's notions of what constitutes ham and spam, so this
kind of mistake is less frequent here than under other systems (for example,
the order receipt mentioned above wasn't called spam, because the system
knew I ordered other software of similar nature in the past; but the email
*would* have been called spam if most other people had received it).  But
the error rates are, while very low for individual use, still non-zero, and
I expect they always will be.

So, if you try this, I suggest setting ham_cutoff very low (below 0.05), and
spam_cutoff very high (over 0.95).  The mdedian ham score is essentially 0,
and the median spam score is essentially 1.0, so, while aggressive, this
isn't quite as extreme as it may sound at first.  The problem I expect
remains, though:  solicited commercial email, and especially the first few
times a user gets one from a given vendor, will end up Unsure, and there may
not be anyone on the other end to respond to a postage nag.