[Spambayes] spambayes fronting a mailing list?
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Thu Jan 16 17:44:02 EST 2003
This is a great discussion, one I think we should include on the main site.
This is obviously a superiority that we (this algorithm) has... you can hardly
go wrong! In my installation, I have 156 spam and 223 ham, and it almost
never makes a classification mistake. Unsures are almost always ham, spam is
DOA. It hasn't improved (there is precious little *room* for improvement)
since a few days after I started this database. In fact, I'm a bit reluctant
to reup, it's working so well <wink>
Now my mail is a bit unique in that I get mostly machine driven event
notification mails, which are VERY similar... There's probably 5 different
email content patterns/sources that comprise 90% of my mail (e.g. "Order
Received", "Mail List Opt-In" "Spambayes", etc.) But even the unique stuff is
nailed as ham almost all the time.
Perhaps we can document a few training patterns: mistake driven,
classification driven, random sample driven, <more?>, and allow users to
select which type of training pattern they want to do. The user interface,
then, might only present messages that are pertinent for that type of training
regimen. For example, the pop3proxy right now presents every message it
receives in buckets by classification. If I'm doing classification driven
training, I wouldn't need to look at every spam that comes in... Oh I don't
know, I'm rambling now... - TimS
1/16/2003 3:41:37 PM, Rob Hooft <rob at hooft.net> wrote:
>Tim Stone - Four Stones Expressions wrote:
>> I think I'm hearing something on this thread that doesn't make much sense
>> me. If we always train as spam stuff that's been classified as spam,
>> train as ham stuff that's been classified as ham, then we're kinda
>> the obvious, and increasing the spaminess of words in that spam... isn't it
>> more realistic (and ultimately actually better) to train on a random sample
>> rather than always? - TimS
>Testing results failed to find any way of training that didn't work well,
>ranging from purely mistake-based training, to letting a classifier
>self-train on its own decisions. My real-life experience on my own email is
>that pure mistake-based training is unsatisfactory in practice because it
>keeps the Unsure rate higher longer than need be (also showed in formal
>tests), and especially because the *kinds* of spam that remained Unsure were
>maddeningly "obvious" spam (something I don't know how to test formally).
>OTOH, in real life now I started with a few hundred random msgs, and since
>then have done *almost* purely mistake-based training. This may not be
>optimal (and I believe it is not), but leaves so little manual
>classification for me to do that I don't care. When error rates get below
>1%, the difference between, say, 0.5% and 0.2% is more than a factor of two,
>but isn't actually noticeable unless you've got many thousands of msgs to
>dig thru. This *is* the case for the mailing list run via
>comp.lang.python's news<->mail gateway, and more-careful training there may
>more than repay the cost. But most Mailman lists have much lower volume,
>and "excellent" results with little training effort may be more attractive
>to list admins than "superb" results requiring substantially more training
>Nope, the mathematics say this isn't true. Say by the word "Sex" you
>recognize a new message as being spam. This message may be the first
>that contains the word "oral", so training on this makes it a spammy
>word. The word "sex" becomes more spammy. And the word "ink-cartridge"
>that does not appear in this message becomes a little less spammy.
>In other words: training on a new spam doesn't only make the tokens in
>it more spammy, but also makes the spammy tokens that do not occur in
>there less spammy.
>Then there are words that occur both in ham and in spam messages. There
>it is important to get the right "balance". If you train only on
>"non-obvious" cases, this will almost certainly result in an imbalance.
>All of this determines, like Tim1 explained, only the difference between
>excellent and superb separation of classes.
>Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/
c'est moi - TimS
More information about the Spambayes