[Spambayes] spambayes fronting a mailing list?

Tim Peters tim.one at comcast.net
Thu Jan 16 13:55:48 EST 2003


[Tim Stone - Four Stones Expressions]
> I think I'm hearing something on this thread that doesn't make
> much sense to me.  If we always train as spam stuff that's been
> classified as spam, always train as ham stuff that's been
> classified as ham, then we're kinda reinforcing the obvious, and
> increasing the spaminess of words in that spam... isn't it
> more realistic (and ultimately actually better) to train on a
> random sample rather than always?  - TimS

Testing results failed to find any way of training that didn't work well,
ranging from purely mistake-based training, to letting a classifier
self-train on its own decisions.  My real-life experience on my own email is
that pure mistake-based training is unsatisfactory in practice because it
keeps the Unsure rate higher longer than need be (also showed in formal
tests), and especially because the *kinds* of spam that remained Unsure were
maddeningly "obvious" spam (something I don't know how to test formally).

OTOH, in real life now I started with a few hundred random msgs, and since
then have done *almost* purely mistake-based training.  This may not be
optimal (and I believe it is not), but leaves so little manual
classification for me to do that I don't care.  When error rates get below
1%, the difference between, say, 0.5% and 0.2% is more than a factor of two,
but isn't actually noticeable unless you've got many thousands of msgs to
dig thru.  This *is* the case for the mailing list run via
comp.lang.python's news<->mail gateway, and more-careful training there may
more than repay the cost.  But most Mailman lists have much lower volume,
and "excellent" results with little training effort may be more attractive
to list admins than "superb" results requiring substantially more training
effort.

The important thing now is just that Barry get off his ass and start <wink>.




More information about the Spambayes mailing list