Incremental training (was RE: [Spambayes] maybe a procmailquestion... )

Tim Peters tim.one at comcast.net
Fri May 23 00:56:27 EDT 2003


[Meyer, Tony]
> I'm surprised you didn't push Mark to get the Outlook plugin to self
> train :)  What method did you & Rob use to test this?  Is it something
> that others could easily duplicate?

It's in the archives ... somewhere <wink>.  It's dead easy in principle, and
not much harder in practics ... ah, TimS explained it more briefly than I
could (thanks, Tim!).

I think we lost interest in training strategies when real-life deployment
showed excellent results in a few days, and several long-time users stopped
paying any attention to training anymore.  My 3 DBs each have about 1000
msgs, and that's all.  I rarely train on anything anymore.  Every now and
again I blow a database away and start over, just to clear the boredom.
Every training strategy I've tried works fine, *except* for purely
mistake-based training from the very start.  In the two+ weeks I stuck to
that, I didn't get above 150 trained msgs total (against about 600 emails
per day), and the Unsures started and remained maddening (mostly blatant
spam).  That was highly hapax-driven, and hapaxes are brittle (they catch
near-duplicates of msgs you've trained on, but don't seem ever to
generalize).

So I can have a tiny database that's more than good enough to make me very
happy, or push Mark toward training on 600 emails a day and have a gigantic
database that wouldn't make me any happier.  This wasn't a hard choice
<wink>.




More information about the Spambayes mailing list