[spambayes-dev] Re: Idea to re-energize corpus learning
Martin Stone Davis
m0davis at pacbell.net
Mon Nov 17 09:25:08 EST 2003
Skip Montanaro wrote:
> Martin> I recently started this thread on the POPFile forum, but it
> Martin> applies just as well to SpamBayes.
> Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
> See my note from Sunday on spambayes-dev:
> Just because you train on a gazillion spams and hams doesn't mean the best
> course once you've screwed something up isn't to start over. Like I said in
> the above message, I think there's a certain psychological barrier you have
> to overcome before you throw out a massive training database. I suspect
> POPfile learns about as quickly as SpamBayes, so without proof I assert that
> starting over there is often going to be the right course as well.
> For example, it's rather easy for me to scan my current training database
> for mistakes, either in a semi-automated fashion using sb_filter.py or
> manually, because it only contains about 250 messages. This was extremely
> difficult using my previous monster database (15k-20k messages).
Wouldn't it be nice if there were some middle ground between continuing
to train the huge immovable database and starting over fresh? After
all, it's more than just a psychological barrier. Having to train 100%
of incoming messages after starting over is real work, and especially
frustrating when you *know* that 80-90% would have been correctly
classified anyway if only you hadn't started over.
So why not soften the blow? That's what my proposal amounts to:
achieving some sort of middle ground between the status quo and starting
over. After performing a "Soften training SEVERELY" (where the counts
are all set to their square roots), messages would still be classified
in more-or-less the same way. However, further training would then be
far more effective, since the counts would be lower.
Doesn't that sound like a good idea?
P.S. I'm also sure that POPfile learns just as quickly as SpamBayes,
since they are based on the same principle.
More information about the spambayes-dev