Incremental training (was RE: [Spambayes] maybe a
procmailquestion... )
Tim Peters
tim.one at comcast.net
Fri May 23 00:56:27 EDT 2003
[Meyer, Tony]
> I'm surprised you didn't push Mark to get the Outlook plugin to self
> train :) What method did you & Rob use to test this? Is it something
> that others could easily duplicate?
It's in the archives ... somewhere <wink>. It's dead easy in principle, and
not much harder in practics ... ah, TimS explained it more briefly than I
could (thanks, Tim!).
I think we lost interest in training strategies when real-life deployment
showed excellent results in a few days, and several long-time users stopped
paying any attention to training anymore. My 3 DBs each have about 1000
msgs, and that's all. I rarely train on anything anymore. Every now and
again I blow a database away and start over, just to clear the boredom.
Every training strategy I've tried works fine, *except* for purely
mistake-based training from the very start. In the two+ weeks I stuck to
that, I didn't get above 150 trained msgs total (against about 600 emails
per day), and the Unsures started and remained maddening (mostly blatant
spam). That was highly hapax-driven, and hapaxes are brittle (they catch
near-duplicates of msgs you've trained on, but don't seem ever to
generalize).
So I can have a tiny database that's more than good enough to make me very
happy, or push Mark toward training on 600 emails a day and have a gigantic
database that wouldn't make me any happier. This wasn't a hard choice
<wink>.
More information about the Spambayes
mailing list