[spambayes-dev] Re: Idea to re-energize corpus learning

Martin Stone Davis m0davis at pacbell.net
Mon Nov 17 09:25:08 EST 2003

Skip Montanaro wrote:

>     Martin> I recently started this thread on the POPFile forum, but it
>     Martin> applies just as well to SpamBayes.
>     Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
> See my note from Sunday on spambayes-dev:
>     http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html
> Just because you train on a gazillion spams and hams doesn't mean the best
> course once you've screwed something up isn't to start over.  Like I said in
> the above message, I think there's a certain psychological barrier you have
> to overcome before you throw out a massive training database.  I suspect
> POPfile learns about as quickly as SpamBayes, so without proof I assert that
> starting over there is often going to be the right course as well.
> For example, it's rather easy for me to scan my current training database
> for mistakes, either in a semi-automated fashion using sb_filter.py or
> manually, because it only contains about 250 messages.  This was extremely
> difficult using my previous monster database (15k-20k messages).
> Skip

Wouldn't it be nice if there were some middle ground between continuing 
to train the huge immovable database and starting over fresh?  After 
all, it's more than just a psychological barrier.  Having to train 100% 
of incoming messages after starting over is real work, and especially 
frustrating when you *know* that 80-90% would have been correctly 
classified anyway if only you hadn't started over.

So why not soften the blow?  That's what my proposal amounts to: 
achieving some sort of middle ground between the status quo and starting 
over.  After performing a "Soften training SEVERELY" (where the counts 
are all set to their square roots), messages would still be classified 
in more-or-less the same way.  However, further training would then be 
far more effective, since the counts would be lower.

Doesn't that sound like a good idea?


P.S. I'm also sure that POPfile learns just as quickly as SpamBayes, 
since they are based on the same principle.

More information about the spambayes-dev mailing list